General Protection Fault during boot in initrd phase

Asked by Sibidharan

I bought a 13900k and gave 128 GB RAM. I am running Ubuntu Server 23.04 and running 10s of VMs using KVM. I started experiencing random "general protection fault" kernel panics all referred to some type cross cache permission violation which I was able to fix by adding slub_debug=F in kernal parameters as suggested in https://access.redhat.com/solutions/2149041

I tried to boot into any live USB, it just crashes. It was weird. The kernel is non tainted but it crashes with the same type of permission violation in kmem_cache_alloc, and any live USB I boot even without harddisk had same issues. But with luck I am able to turn on my server, and since I have slub_debug=F added to kernel, it didn't crash during operations and it ran for weeks together.

It was working for sometime, until one day a power failure happened, and then when I restarted the server, it was saying this error I attached here. The errors before slub_debug=F showed different address in the panic, but they are all page fault errors, but I was convinced it was due to freelist pointer corruption as said by RedHat support. I suspected if its faulty RAM, so I ran memtest and it passed. This time, the error is same across different kernels. Even I tried to boot Windows in a new SSD, it couldn't boot, and I attached the BSOD here, which all points to the "general protection fault" by the processor.

But now, its panicking in the initrd phase, while the kernel is doing some udev stuffs, I am never able to find what is causing this because the logs are not recorded since the panic happens in initrd phase, there is nowhere to write them. Interestingly, the same error in the same location is happening even if I boot different kernels via live USB now. I thought I lost the server. I did memtest, it passed again. I removed each peripheral I have connected and tested, nothing helped, Until I read somewhere to use maxcpus=1 and limit the number of CPUs, and it worked, boom my computer is working. Booted up and running, but now with only one CPU. I didn't know what was wrong, until I did the same in BIOS, limited the number of cores to 1, enabling only one core in performance cores and disabled all efficiency cores. I got 2 logical CPUs due to hyper threading and it is working.

I read in a lot of places that the CPU cores are going faulty, https://access.redhat.com/solutions/3915511

Similar situation here: https://www.linuxquestions.org/questions/linux-desktop-74/not-present-page-kernel-panic-4175722803/

As said in above link, I also tried to enable the remaining cores after able to boot with only one core successfully. But I see that CPUs are getting into hardlockups or softlockups. I even tried to add softlockup_panic=0 in kernel params, its not panicing then, but just hangs forever. Its a lockup, CPU is not responding. In syslog and kernlog, I see something like this. Permission violation.

<code>
[2.911260] kernel tried to execute NX-protected page - exploit attempt?
(uid: 0)
[2.911260] BUG: unable to handle page fault for address:
fffffe00000453a8
[2.911261] #PF: supervisor instruction fetch in kernel mode
[2.911261] #PF: error_code(0x0011) - permissions violation
[2.911262] PGD 87efc6067 P4D 87efc6067 PUD 87efc4067 PMD 87efc3067 PTE
000000085fc4d163
[2.911264] Thread overran stack, or stack corrupted
[2.911264] Oops: 0011:0xfffffc000000453a8
</code>

How I came to the conclusion that individual cores are faulty?

I rolled up my sleves and moved further and enabled all efficiency cores, and only one performance core, boom the computer is working normally. Only if I enable the remaining performance cores, the kernel panic is happenning, and its the same error.  I am now running good with 17 cores and 18 logical CPUs. Its running amazingly well, I am able to boot liveUSB, even able to run windows.

What is wrong here? Is an individual CPU core in performance cores has gone faulty? I didn't try experimenting with other performance cores yet since my server is back on, i want it running. I will do that experiment eventually.

Windows BSOD:
https://ibb.co/VMqTPLp
https://ibb.co/d5yKv4M

Panic in my server:
https://ibb.co/Y3Dnb84

Panics from different kernels via LiveUSB

https://ibb.co/BwxW3bw
https://ibb.co/y8mPVLB
https://ibb.co/HVwfBBP
https://ibb.co/4NJTbzS
https://ibb.co/KzFfHQh
https://ibb.co/x8WdzhJ
https://ibb.co/370Rfqb
https://ibb.co/svFbPF2
https://ibb.co/4NJTbzS

Question information

Language:
English Edit question
Status:
Open
For:
Ubuntu Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:

This question was reopened

Revision history for this message
Sibidharan (sibi1995) said :
#1

I just changed to 14th gen 19-14900K and all issues are magically gone. The server is booting up butter smooth and no panics anywhere, no lockups anywhere!!

Its the bloody i9-13900K, everyone (or a subset) who bought this is silently suffering.

Please change the CPU. Thats the only solution.

Revision history for this message
Sibidharan (sibi1995) said :
#2

I just changed to 14th gen 19-14900K and all issues are magically gone. The server is booting up butter smooth and no panics anywhere, no lockups anywhere!!

Its the bloody i9-13900K, everyone (or a subset) who bought this is silently suffering.

Please change the CPU. Thats the only solution.

Can you help with this problem?

Provide an answer of your own, or ask Sibidharan for more information if necessary.

To post a message you must log in.