Bug 1950764: Work Around Crash on Intel Raptor Lake CPU

upvote

Bug 1950764: Work Around Crash on Intel Raptor Lake CPU

(phabricator.services.mozilla.com)

149 points

by luu2 days ago |

upvote

by bri3d14 hours ago|

[-]

Linked in the Bugzilla thread is a really nice in depth investigation of the same issue with high register aliases in a similar algorithm (Huffman coding) but in an entirely different product: https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-in... .

It's concerning that Intel don't seem to have been responsive to anyone with respect to this issue and it doesn't appear to have an official errata yet, although Raptor Lake was the Intel CPU with voltage issues and basically random bit rot so I suppose it's hard to tell if this is a silicon level errata caused by bad design or by some kind of post-manufacturing damage. Raptor Lake in general causes enough non-reproducible noise that I believe Firefox gave up on automated crash reports from it ( https://bugzilla.mozilla.org/show_bug.cgi?id=1975808 ).

EDIT: I read that Oodle article (which is SO good!) again and realized that their customer-provided reproduction of the bug was directly linked to boost clock speeds (the customer said that overclocking by 5% made it happen entirely reliably), so this is definitely not a "the architecture has a 100% bug in it" but rather some deeper issue with clock propagation that appears at edge cases.

reply

upvote

by m1329 hours ago|

[-]

Read the Oodle article in full, fantastic investigation indeed!

It also looks like there's a slight difference in the unwanted effect both companies have reported, despite the bug being seemingly triggered the same way (mov touching the high byte):

- Oodle reports that a low byte is occasionally stored in the intended location.

- Mozilla's fix suggests that a full 16-bit value is stored instead, corrupting an adjacent variable! This could have much more serious consequences.

Technically, this could still be the same exact bug. I found no mention of the order the output buffer was accessed in by the Huffman decoder debugged in the Oodle report, and, since it was a contiguous buffer, it's easy to mistake an occasional out-of-bounds copy there for a copy from a wrong location. But if both analyses are correct, the behavior of high byte accesses on Raptor Lake is way less predictable than those fixes suggest. Haven't managed to find an official erratum from Intel.

reply

upvote

by mtlmtlmtlmtl11 hours ago|

[-]

It's very interesting because my 13900K has worked like a dream from day one and still to this day. Never had any of the voltage issues, never had any abnormal crashes in Firefox or any other software. I was undervolting it for a long while, so I wonder if somehow that saved me from the voltage issues before they were fixed?

reply

upvote

by Numerlor6 hours ago|

[-]

Undervolting would definitely help, and is the actual fix. The current Intel fixes were mostly just for the symptoms, as the main issue is high voltage+power when pushing high clocks, but they can't actually fix that as it'd downgrade the advertised clocks the cpus were sold with

reply

upvote

by anonymars4 hours ago|

[-]

Sorry, but that understanding is dangerously incomplete. You're describing the first set of issues they uncovered, but there's also:

"Microcode and BIOS code requesting elevated core voltages which can cause Vmin shift especially during periods of idle and/or light activity" (emphasis mine)

https://community.intel.com/t5/Blogs/Tech-Innovation/Client/...

Recall also that "Vmin shift" means "the minimum voltage the processor needs to run correctly goes up" so if the issue isn't addressed, that level of undervolt may stop working

reply

upvote

by Numerlor45 minutes ago|

[-]

Not sure what's supposed to be wrong with that? The clock tree degrades at high voltage. Some theories I've seen were on the CPU requesting significantly higher voltages during alternating clocks when there's a short lull in load from e.g. a pipeline stall. Then there doesn't seem to be a good enough of a sensor net in the correct places for the CPU to react to this, so it just "burns" itself down gradually. Assuming these are true, actual fixes from intel would be relaxing boost clocks to ones that are universally safe and open themselves to a lawsuit from everyone that bought the high end SKUs, or do a new stepping which is extremely expensive for a done design.

When you degrade the CPU naturally needs higher voltages to be stable, until the point where it just breaks completely and no amount of voltage it help it. But if your CPU doesn't degrade because it hasn't been overdoing it on voltages then there'll be no issues for Vmin to shift.

As an anecdotal experience from someone I know that runs these in prod for game servers, limiting the CPU to 80°C and 1.4V-1.45V, 400A has been keeping them alive for years doing 24/7 loads. Maybe a bit lower on the voltage if one wants to be sure longer term, as they are fine with just mass RMAing these. There's also large amount of differences in the silicon quality between samples that can make one run cool and completely fine even at the old stock settings, and an another sample that'll have to pull say 1.5x the power for the same load and clocks having it degrade.

reply

upvote

by anonymars15 minutes ago|

[-]

You're implying that if you don't run the CPU at high power and high heat it won't have problems, and that undervolting or underclocking will prevent damage. This is not correct: while that is helpful, Vmin degradation occurs during idle or light activity as well

Vmin will creep up, and the headroom for undervolting will degrade. It will affect the high clocks first (they demand the highest voltage), which is why dropping the max boost multiplier a step or two can also work around it (at the cost of basically downgrading it to a cheaper processor)

reply

upvote

by Numerlor1 minutes ago|

[-]

Idle and light load is bad for degradation only because that's the most common scenario where the boosting algorith will actually go to the highest clocks. More loaded cores will have the CPU target lower clocks on all cores so that it actually can get the power for it and have the CPU be coolable, but if you're idle and then some task loads just a single core for a bit the CPU will boost it the highest it can. The voltage spikes from those boosts will cause local hotspots even if the CPU is cool overall

reply

upvote

by whizzter8 hours ago|

[-]

I remember Puget systems pointed to this same thing when they analyzed the issues back in the day when it was blowing up.

https://www.pugetsystems.com/blog/2024/08/02/puget-systems-p...

reply

upvote

by nubinetwork6 hours ago|

[-]

My 1360p and 13400 seem fine too. I applied the microcode and firmware updates when they came out... but I'm guessing it didn't affect all skus equally for whatever magical reason.

reply

upvote

by nekzn6 hours ago|

[-]

My 13900K was affected by the widespread voltage issue and had to be replaced, but since then I have had zero problems with it.

reply

upvote

by 13 hours ago|

[-]

deleted

reply

upvote

by Polizeiposaune14 hours ago|

[-]

Details of the errata from a comment in the diff:

"Write both dist bytes as a single 2-byte store. This avoids the `movb %ch, [mem]` instruction pattern (store from high-byte register alias) that LLVM otherwise emits when dist arrives as a wide register. That pattern triggers the Intel Raptor Lake CPU errata, causing silent 2-byte stores that corrupt the adjacent `len` byte."

reply

upvote

by whadawha14 hours ago|

[-]

How did this get past validation at Intel?

This is worse than https://en.wikipedia.org/wiki/Pentium_FDIV_bug

reply

upvote

by bri3d14 hours ago|

[-]

There's another blog post going into more depth about the issue here: https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-in... where they speculate that it seems to relate to both other clock-related instability on specific Raptor Lake parts and possibly the overarching voltage control problems that the platform had early on; I can't tell entirely from the bug reports whether the behavior reliably reproduces on 100% of Raptor Lakes but the indicators I'm reading point to that it doesn't. It is concerning that Intel didn't get back to Mozilla about it though, since it's certainly a lot more than a one off.

reply

upvote

by userbinator13 hours ago|

[-]

"validation? what validation?"

https://news.ycombinator.com/item?id=27244941

Edit: you should probably read the article I linked first.

reply

upvote

by moffkalast8 hours ago|

[-]

Common Raptor Lake L, add it to the pile of oxidation and overvoltage faults. This has to be the most faulty generation in Intel's entire recent manufacturing history.

reply

upvote

by happycube3 hours ago|

[-]

It's the Alder Lake MAX. Originally decent design just pushed too far.

reply

upvote

by dmitrygr14 hours ago|

[-]

modifying source to avoid an assembly isntr isn't a fix... this need a compiler fix most likely, or a microcode fix, if possible.

reply

upvote

by whadawha13 hours ago|

[-]

Anyone have knowledge of whether microcode can be patched on consumer grade Intel CPUs?

reply

upvote

by bri3d13 hours ago|

[-]

Yes? It is regularly; both the firmware or the OS can deliver updates depending on configuration. The Raptor Lake CPUs in question have gone through an enormous number of microcode revisions already due to quite famous voltage scaling issues; it's unclear if this errata is fallout from or related to a similar root cause or just another issue with the processor.

reply

upvote

by altairprime13 hours ago|

[-]

https://github.com/intel/intel-linux-processor-microcode-dat...

  $ echo 1 > /sys/devices/system/cpu/microcode/reload

Hot-swappable, even. TIL!

reply

upvote

by xxs7 hours ago|

[-]

of course, it's hot-swap material, the microcode is 1st deployed by the bios, then the OS can apply changes as well.

Just that it's writable by $ (not #) feels awkward.

reply

upvote

by __patchbit__12 hours ago|

[-]

At boot time, the following package provides the latest Intel CPU microcode data files on NetBSD.

  sysutils/intel-microcode-netbsd

dmesg shows

  cpu 0: ucode 0xf0->0xf6
  cpu 1: ucode 0xf0->0xf6

reply

upvote

by throwaway20379 hours ago|

[-]

Why is this downvoted? (At the time of writing, the text is grey, so it has at least a few downvotes.)

This is a good question. As others have noted below, yes, and sometimes you can see kernel logging on start-up when the microcode is loaded.

reply

upvote

by wtallis3 hours ago|

[-]

It's a dumb question, because it's in reply to a comment that already implies the answer, and it's trivial to find an answer online in less time than it takes to post that question and wait for someone to supply an answer.

The subject of CPU microcode update mechanisms is an interesting and relevant topic, but such a shallow, low-effort question is not a good way to promote interesting discussion on that topic.

reply

upvote

by varispeed9 hours ago|

[-]

[flagged]

reply

upvote

by db48x30 minutes ago|

[-]

It’s not elitism, it’s just self defense. Every forum starts out with high–value high–signal low–noise conversations and gradually decays towards low–signal high–noise conversations as new people are brought in. The new people are, by definition, new. They don’t know much if anything yet so they cannot participate meaningfully in advanced topics. Naturally they ask questions in order to fill in the gaps in their knowledge. It is simply unfortunate that the effect is to increase the amount of noise in the forum as each new member asks the same questions over and over again. This leads to the most knowledgeable members of the forum dropping out, as the quality of the discussion drops below the point where it is worth their time.

See also “Eternal September”.

reply

upvote

by robin_reala13 hours ago|

[-]

Also worth reading this thread on the subject: https://mas.to/@gabrielesvelto/116630047156991279

Regarding the Raptor Lake bug I received a couple of messages from confused users that had read articles on Tomshardware and Neowin. They asked about erratas and microcode updates which puzzled me, because that was part of my early investigation into the bug and we know that the failure is not caused by a known errata and microcode updates cannot fix broken CPUs. So why did they ask? As it turns out it was slop. Both articles are 100% slop full of confusing and inaccurate claims.

reply

upvote

by websg-x9 hours ago|

[-]

Because it's a known problem. It's called Vmin Shift Instability issues. The affected CPUs are broken. One needs to RMA the CPUs. Intel also extended CPU warranty for 2 more years.

Because there still are many broken CPUs out in the wild. Firefox works around the crash so the broken CPUs won't flood the channel with crash reports.

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by samlinnfer6 hours ago|

[-]

Sorry it's still not clear what he means? When a CPU is "broken", is it already failing or is it "broken" in the sense it will fail?

For example:

Does he mean all existing 13th/14th gen CPUs (prior to Intel's discovery of the vmin issue) are broken in the sense that they are susceptible to damage and can only be replaced.

OR

Does he mean that the microcode updates, applied by Intel to existing CPUs that are susceptible to damage, will only slow degradation and the CPUs will eventually fail and can only be replaced.

OR

Is he saying the 13th/14th gen CPUs which have already sustained damage, cannot be fixed by microcode updates.

reply

upvote

by websg-x5 hours ago|

[-]

Only the desktop 8p+16e cores(13600/700/900,14600/700/900)k are susceptible. The CPUs are safe with the fixed bios/microcode. The notebook version of 8p+16e CPUs are also safe from vmin shift since notebook computer cannot go insane with voltage.

If the CPU is damaged already, the new microcode wont fixed the problem. It broken. You have to RMA the CPU.

The vmin shift instability is fixed. There are no new report of mass failures of 13th/14th gen CPUs after the new bios/microcode release.

reply

upvote

by mike_hock14 hours ago|

[-]

Uh ... working around this in each and every piece of software sounds like a non-starter? Intel should be on the hook to fix this.

reply

upvote

by Polizeiposaune14 hours ago|

[-]

Use of the "h" register slices (bits 8..15) by compilers is thankfully pretty rare -- otherwise this would have been noticed much sooner!

Agner Fog's optimization guide says "Any use of the high 8-bit registers AH, BH, CH, DH should be avoided because it can cause false dependences and less efficient code."

reply

upvote

by anarazel9 minutes ago|

[-]

> Use of the "h" register slices (bits 8..15) by compilers is thankfully pretty rare -- otherwise this would have been noticed much sooner!

It's actually pretty easy to get compilers to use those, you mainly need a bunch of narrow accesses to neighboring memory. The oodle post contains a godbolt link to pretty ordinary c code triggering this.

I'd guess that you also need some other conditions (multiple in flight stores, high boost speeds) to trigger this.

reply

upvote

by userbinator14 hours ago|

[-]

Use of the "h" register slices (bits 8..15) by compilers is thankfully pretty rare

That's unfortunate, because it's precisely why things like this will keep happening.

Agner Fog's optimization guide says "Any use of the high 8-bit registers AH, BH, CH, DH should be avoided because it can cause false dependences and less efficient code."

The sad vicious cycle of compilers not exercising the hardware, and then the hardware designers not paying attention. Using the high 8-bit registers and "implicitly merging" them is one of the ways to reduce the number of instructions and thus improve size optimisation.

reply

upvote

by cesarb3 hours ago|

[-]

> That's unfortunate, because it's precisely why things like this will keep happening.

I have the opposite opinion. Its use being rare means CPU designers have less need to optimize for that rare case, and hardware optimizations are precisely where these kinds of issues tend to pop up.

And high 8-bit registers are a x86-specific feature, other CPU families don't have it. So that special case being less optimized (or even pessimized) is not much of a loss.

reply

upvote

by Polizeiposaune2 hours ago|

[-]

Yep. The "high" registers as an alias for bits 8-15 of certain registers are one of many warts in the architecture; they should have been purged from 32-bit and 64-bit code, and left to rot in 16-bit mode only.

Intel blew it when they let them continue to work in to 32-bit code on the 386, and then AMD blew it when they repeated the mistake when defining the 64-bit ISA.

reply

upvote

by fuhsnn10 hours ago|

[-]

> The sad vicious cycle of compilers not exercising the hardware

There could theoretically be instruction selection passes that are biased toward rare instructions, specialized for fuzzing hardware, I'm surprised Intel doesn't already do that.

reply

upvote

by ahartmetz2 hours ago|

[-]

Wait what, they are using Phabricator? I don't think it's particularly bad, I just though it was... particularly dead.

reply

upvote

by userbinator14 hours ago|

[-]

WTF, Intel? This is reminding me of a very similar bug from 9 years ago: https://news.ycombinator.com/item?id=14630183

Clearly Intel needs to do far more extensive regression-testing, with things like demoscene productions --- especially the extremely size-optimised ones that can exercise the edge-cases much better than the usual "compiler slop".

reply

upvote

by hsbauauvhabzb13 hours ago|

[-]

Intel knowingly sold defective cpus and denied the defect until reports hit critical mass. I don’t think they care.

reply

upvote

by userbinator13 hours ago|

[-]

"knowingly" is meaningless, as otherwise they wouldn't even bother releasing errata lists; it's more likely that they underestimated the severity or their planned obsolescence calculations happened to be more statistically favourable than reality.

https://news.ycombinator.com/item?id=41041855

reply

upvote

by close047 hours ago|

[-]

> "knowingly" is meaningless

You’re intentionally muddying the waters with meaningless philosophy. Even the law makes the difference between “knowingly” (with knowledge, intention, premeditation) and “mistake”. They didn’t knowingly break the CPU but they knowingly launched it despite their own internal findings, and knowingly blamed others when this came out.

But here you are claiming that the company must deserve the benefit of the doubt on their intent.

No, “knowingly” is most definitely not “meaningless”. And anyone who’s not naive or bad intentioned should make that difference and take note. Every time a company gets away with it because an army of philosophers washes away any guilt or plays it down with meaningless distinctions it becomes one more reason to do it again. Knowingly.

reply

upvote

by hsbauauvhabzb12 hours ago|

[-]

The link you provided doesn’t match your comment, one of the comments in that thread points out that Intel blamed motherboards during the early stages.

reply

upvote

by moffkalast7 hours ago|

[-]

Yes let's give a 50 billion dollar corporation flooded with MBAs the benefit of the doubt lmao. If GamersNexus can be believed, it was extremely deliberate to try and sweep it all under the rug.

I don't think they've clarified which exact models of which serial number ranges are affected with oxidation to this day. It should've been a recall.

reply

upvote

by charcircuit13 hours ago|

[-]

Hopefully this bug is getting handled upstream in a microcode update or a compiler fix to avoid emitting such instructions. Just a comment mentioning that you should not emit a particular instruction is not a strong guarantee.

reply

upvote

by progval10 hours ago|

[-]

According to https://bugzilla.mozilla.org/show_bug.cgi?id=1950764#c23 , it is not getting fixed.

reply

upvote

by WaylandYang11 hours ago|

[-]

[flagged]

reply

upvote

by codedokode7 hours ago|

[-]

I looked at the Raptor Lake errata [1] and it looks pretty scary. What if someones builds an exploit on these errors?

This is why CPU designers should aim for simplicity. This is why RISC-V vector extension, which requires complicated logic, can become a source of implementation errors.

[1] https://edc.intel.com/content/www/us/en/design/products/plat...

reply

upvote

by adrian_b7 hours ago|

[-]

All Intel, AMD or Arm-based CPUs and any other modern CPUs have dozens of errata, even if some CPU vendors keep them secret, instead of publishing them, as they should.

Fortunately, most of the erroneous behaviors are triggered only by very unlikely combinations of circumstances, some of which may even be impossible to happen in user programs, but only in operating system kernels.

Nevertheless, from time to time there are also serious errata, like the one discussed here, which can be triggered even by ordinary user programs. Sometimes, like here, such errata can be avoided by compilers patched to not generate the buggy instructions for the affected CPU models, assuming that it is known with certainty on which model of computer the compiled program will be executed (or using code dispatch at run time, based on the CPU model).

Simplicity in the CPU hardware may reduce the probability of hardware bugs, but it increases the probability of software bugs, because the missing hardware features must be implemented at a much greater cost in software, like in the case with the missing integer overflow detection of RISC-V, which causes most RISC-V programs to omit overflow checks, increasing the chances of undetected bugs.

reply

upvote

by camel-cdr5 hours ago|

[-]

> Simplicity in the CPU hardware may reduce the probability of hardware bugs, but it increases the probability of software bugs, because the missing hardware features must be implemented at a much greater cost in software, like in the case with the missing integer overflow detection of RISC-V, which causes most RISC-V programs to omit overflow checks, increasing the chances of undetected bugs.

Since I've got a SpacemiT K3 board my self now, I though I test it again:

I compiled microjs with both tinycc and chibicc, which where both compiled for the target platform with and without -ftrapv:

    Slowdown Zen1: tinycc: 1.34%, chibicc: -0.3% (slight speedup somehow?)
    Slowdown X100: tinycc:  0.1%, chibicc:  3.4%

Last time I did full clang: https://news.ycombinator.com/item?id=47328214#47342362 And there was minimal slowdown (sometimes speedup) on x86, Arm and RISC-V. It was pointed out that llvm mostly uses size_t, however chibicc and tinycc use int as their default type, so there should be lots of overflow checking.

reply

upvote

by codedokode5 hours ago|

[-]

And Rust omits overflow check under the same excuse, although overflow was a reason that allowed multiple Linux kernel vulnerabilities.

reply

upvote

by GoblinSlayer3 hours ago|

[-]

Buffer overflows are caught by bound checks that don't need integer overflow checks, cf dotnet.

reply