We found an undocumented bug in the Apollo 11 guidance computer code

(www.juxt.pro)

428 points

by henrygarner1 days ago |

199 comments

by thewonderidiot1 days ago|

[-]

Mike Stewart here! I led the restoration of the AGC documented on CuriousMarc's channel and co-administrate VirtualAGC. There is a lot to unpack here.

First: this is indeed a real bug in the AGC software. However, it did not go unnoticed for the whole program. It was discovered during level 3 testing of SATANCHE, and late development branch of the Command Module software COMANCHE. It was assigned anomaly number L-1D-02, and was fixed between Apollo 14 and 15. There are two known surviving copies of the L-1D-02 anomaly report:

* https://www.ibiblio.org/apollo/Documents/contents_of_luminar...

The fix described in the article is partially complete, but as noted in the anomaly report there's a little bit more to it. Rather than just adding the two instructions to zero LGYRO, they restructured the code a bit and also cause it to wake up pending jobs. You can compare the relevant sections of the Apollo 14 and Apollo 15 LM software here:

* Apollo 14: https://github.com/virtualagc/virtualagc/blob/master/Luminar...

* Apollo 15: https://github.com/virtualagc/virtualagc/blob/master/Luminar...

The bug would not manifest silently in the way described in the article. For starters, LGYRO is also zeroed in STARTSB2, which is executed via GOPROG2 on any major program change: https://github.com/virtualagc/virtualagc/blob/master/Luminar...

This means that changing from any program to any other program would immediately resolve the issue. This is almost certainly a large part of why it took them so long to notice. Hitting BADEND while actively pulse-torquing is quite rare, and avoided by normal procedure. The scenario presented in the article can't happen since the act of starting P52 will zero LGYRO.

Moreover, in the very specific scenarios in which the bug can be triggered and remain, it results in multiple jobs stacking up attempting to torque the gyros. Eventually the computer runs out of space for new jobs -- similar to what happened on 11 -- and a 31202 (the Apollo 12+ equivalent of 1202) is triggered.

Since the issue was found before the flight of Apollo 14, a further description of how it might occur and what the recovery procedure should be was added to the Apollo 14 Program Notes: https://www.ibiblio.org/apollo/Documents/LUM159_text.pdf#pag...

Some other notes:

> Ken Shirriff has analysed it down to individual gates

I've done the bulk of the gate-level analysis. :)

> the Virtual AGC project runs the software in emulation, having confirmed the recovered source byte-for-byte against the original core rope dumps.

We've only been able to do that in very specific circumstances and only for subsections of assorted programs, but never for a full program. Most AGC software either comes from a program listing, from a core rope dump, or from reconstruction using changelogs and known memory bank checksums. We've disassembled all of the rope dumps into source files that assemble back into the same binary, but the comments and labels will be different from what was in the original listing. And to be extra clear: I've never had the opportunity to dump a module containing Apollo 11 software for either vehicle. Our sole source for both programs is a pair of printouts in the MIT Museum's collection.

> Margaret Hamilton (as “rope mother” for LUMINARY) approved the final flight programs before they were woven into core rope memory.

Jim Kernan was the rope mother for Luminary at least up through Apollo 11. Margaret was the rope mother for Comanche, the CM software, and was later promoted to lead the software division. Their positions at the time of 11 can be seen on this org chart: https://www.ibiblio.org/apollo/Documents/ApolloOrg-1969-02.p...

> Their priority scheduling saved the Apollo 11 landing when the 1202 alarms fired during descent, shedding low-priority tasks under load exactly as designed.

This is a huge topic on its own, but the AGC software was not designed to shed low-priority jobs. Ironically, the lowest priority job during the landing was the landing guidance itself, with high-priority jobs being reserved for things that needed quick response like antenna movements or display updates. If the computer were to shed the lowest-priority jobs, it would shed the landing guidance. This memo contains a list of all jobs active during the landing and their priorities: https://www.ibiblio.org/apollo/Documents/CherryApollo11Exege...

> For example, the ICD for the rendezvous radar specified that two 800 Hz power supplies would be frequency-locked but said nothing about phase synchronisation. The resulting phase drift made the antenna appear to dither, generating roughly 6,400 spurious interrupts per second per angle and consuming roughly 13% of the computer’s capacity during Apollo 11’s descent. This was the underlying cause of the 1202 alarms.

The frequency-lock prevents phase drift, so the phase is essentially fixed once the power supplies are up. Ironically, however, the bigger issue is that one reference was 28V while the other was 15V. Initial testing on actual Apollo hardware suggests that at least for Apollo 11, this voltage difference was the key contributor rather than the phase difference: https://www.youtube.com/watch?v=dT33c70EIYk