upvote
Also fab companies have had to learn to be incredibly conservative about perceptively meaningless changes.

Most famously illustrated by Intel's "Copy Exactly!" methodology. https://duckduckgo.com/?q="copy%20exactly"+Intel

An adjacent IBM story that kinda explains why:

  During the year 1986, there was an anomalous increase in LSI memory problems. Electronics in early 1987 appeared to have problem rates approaching 20 times higher than predicted. In contrast, identical LSI memories being manufactured in Europe showed no anomalous problems. Because of knowledge of the radioactivity problem with the Intel 2107 RAMs, it was thought that the LSI package probably was at fault, since the IBM chips were mounted on similar ceramic materials. LSI ceramic packages made by IBM in Europe and in the U.S. were exchanged, but the European computer modules (with European chips and U.S. packaging) showed no fails, while the U.S. chips with European packages still failed at a high rate. This indicated that the problem was undoubtedly in the U.S.-manufactured LSI chips. In April 1987, significant design changes had been made to the memory chip with the most problems, a 4Kb bipolar RAM. The newer chip had been given the nickname Hera, and so at an early stage the incident became known as the "Hera problem."
  By June 1987, the problem was very serious. A group was organized to investigate the problem. The first breakthrough in understanding occurred with the analysis of "carcasses" from the memory chips (the term carcasses refers to the chips on an LSI wafer which do not work correctly, and are not used but saved in case some problem occurs at a future time). Some of these carcasses were shown to have significant radioactivity.
  Six weeks was spent in the manufacturing process lines, looking for radioactivity, and traces were found inside various processing units. However, it could not be determined whether these traces came from the raw materials used, or whether they were transferred from the chips themselves, which might have been contaminated earlier in their processing. Further, it was discovered that radioactive filaments (containing radioactive thorium) were commonly used in some evaporators. A detailed analysis by T. Zabel of some of the "hot" chips revealed that the radioactive contamination came from a single source: Po210 This isotope is found in the uranium decay chain, which contains about twelve different radioactive species. The surprising fact was that Po210 was the only contaminant on the LSI chips, and all the other expected decay-chain elements were missing. Hundreds of chips were analyzed for radioactivity, and Po210 contamination was found going back more than a year. Then it was found that whatever caused the radioactivity problem disappeared on all wafers started after May 22, 1987. After this precise date, all new wafers were free of contamination, except for small amounts which probably were contaminated by other older chips being processed by the same equipment. Since it takes about four months for chips to be manufactured, the pipeline was still full of "hot" chips in July and August 1987. Further sweeps of the manufacturing lines showed trace radioactivity, but the plant was essentially clean. The contamination had appeared in 1985, increased by more than 1000 times until May 22, 1987, and then totally disappeared!
  Several months passed, with widespread testing of manufacturing materials and tools, but no radioactive contamination was discovered. All memory chips in the manufacturing lines were spot-screened for radioactivity, but they were clean. The radioactivity reappeared in the manufacturing plant in early December 1987, mildly contaminating several hundred wafers, then disappeared again. A search of all the materials used in the fabrication of these chips found no source of the radioactivity. With further screening, and a lot of luck, a new and unused bottle of nitric acid was identified by J. Hannah as radioactive. One surprising aspect of this discovery was that, of twelve bottles in the single lot of acid, only one was contaminated. Since all screening of materials assumed lot-sized homogeneity, this discovery of a single bad sample in a large lot probably explained why previous scans of the manufacturing line had been negative. The unopened bottle of radioactive nitric acid led investigators back to a supplier's factory, and it was found that the radioactivity was being injected by a bottle-cleaning machine for semiconductor-grade acid bottles. This bottle cleaner used radioactive Po210 material to ionize an air jet which was used to dislodge electrostatic dust inside the bottles after washing. The jets were leaking radioactivity because of a change in the epoxy used to seal the Po210 inside the air jet capsule. Since these jets gave off infrequent and random bursts of radioactivity, only a few bottles out of thousands were contaminated.
An excerpt from: Ziegler, James F., et al. "IBM experiments in soft fails in computer electronics (1978–1994)." IBM journal of research and development 40.1 (1996): 3-18

Polonium is debuggable. More subtle statistical aberrations would be exponentially harder.

reply
this story would make a killer asianometry video
reply
CSI parody style?

I'm most familiar with software and home electronics debugging, but it would be wonderful to hear some stories from other disciplines where a culprit is found, and also about the forensic tools specific to other domains.

reply