undefined

points

[-]

There's a writeup here from one of the people on the team about the work it took to go from the listings to source code. http://cini.classiccmp.org/recoveryblog.htm

> With less-than-satisfactory OCR output, I resorted to a process I used many years ago when converting scans made of old Commodore ROM dumps printed on a Commodore 1515 dot-matrix printer. The process relies on the ASCII OCR output having the same repetitive errors. "B" and "8", "S" and "5" are good examples, as are "l" and "1", and "O" and "0". There are many other similar single-character errors and, when working with x86 code, there are similar errors with instructions like "MOV". This process naturally works better if the output file is monolithic rather than single-page OCR conversions because you can do substitutions across the entire converted printout and not 75 separate files.

> The next formatting hassle was the spacing. This required repetitive substitutions of a descending numbers of spaces to tabs (i.e., replace 8 spaces with a tab, 7, 6, etc.). Then if you want to return it to fixed spaces (which is likely how the original printer printed it -- spaces and not vertical tabs), you can. For pure re-creation work, spaces produce absolute column formatting while tabs can move around depending on the program displaying the file.

> Once you run thought the 15 or so common global substitutions and tab conversion, it's a lot easier to work with the file to fix formatting and perform other cleanup. This is then followed by a line-by-line comparison against the original printouts. Overall I'd say the conversion output quality with this method is very good.

by FarmerPotato3 hours ago|

parent|

[-]

Hmm, doesn't say anything about what OCR tools they used.

I've got a 4" stack of wide-carriage COBOL. I guess it's two revisions of the same system so I only need to scan the newer half. Its probably from a TI Omni 810.

On the other hand, I've got 100 pages of code printed in compressed font by someone wanting to make sure that 80+ char lines fit within margins. So a lot of words just don't come out at all. A frequent error is "A" becomes "H", "O" becomes "U" because the top dots aren't "attached".

And columns of line numbers starting with 0001, or hex? The most confounding thing is OCR that thinks 00 is a sideways 8, and that dominates the uniform block, so it tries to interpret the whole column as sideways text. In another situation, it interprets two stacked lines (each starting with 0) as one line starting with 8 and it just goes off the rails.

So I've been working with automatic skew correction, then clipping it into rows, in order to get each line of text isolated from the surrounding context. When I do that, I get better results, but it is not great either.

I'm considering going all-in on training a new recognizer on snippets. For that, I'll be constructing "The Set of All As" and so on.

by accrual2 hours ago|

parent|

prev|

[-]

Pretty interesting. I wonder if a whitelist against certain columns in the output could help, e.g. this column can only contain valid x86 instructions (e.g. MOV is allowed, M0V is not), this column can only contain hexadecimal (1 is allowed but never "l"), etc. Probably more work than it's worth given the final line-by-line comparison that happens anyway.

by embedding-shape9 hours ago|

prev|

[-]

Boring reply perhaps, but I've had wild success with adding even a tiny LLM afterwards to do "fixups" over OCRd text, works great for the typical O/0 issues and similar, just pass it the scrambled OCRd text together with the text around it, and even dumb and tiny 7b models running on CPU do a pretty fine job.

by bob7788 hours ago|

prev|

[-]

ABBYY has a specific module for dot matrix printouts so I’m surprised it was a struggle for them but every document is different