I've been able to OCR letter-quality printer output to 97% (mostly Os and Xs problems).
But it seems that machine-learning text-recognition is also now biased to reject computer code because it doesn't look like human language.
> With less-than-satisfactory OCR output, I resorted to a process I used many years ago when converting scans made of old Commodore ROM dumps printed on a Commodore 1515 dot-matrix printer. The process relies on the ASCII OCR output having the same repetitive errors. "B" and "8", "S" and "5" are good examples, as are "l" and "1", and "O" and "0". There are many other similar single-character errors and, when working with x86 code, there are similar errors with instructions like "MOV". This process naturally works better if the output file is monolithic rather than single-page OCR conversions because you can do substitutions across the entire converted printout and not 75 separate files.
> The next formatting hassle was the spacing. This required repetitive substitutions of a descending numbers of spaces to tabs (i.e., replace 8 spaces with a tab, 7, 6, etc.). Then if you want to return it to fixed spaces (which is likely how the original printer printed it -- spaces and not vertical tabs), you can. For pure re-creation work, spaces produce absolute column formatting while tabs can move around depending on the program displaying the file.
> Once you run thought the 15 or so common global substitutions and tab conversion, it's a lot easier to work with the file to fix formatting and perform other cleanup. This is then followed by a line-by-line comparison against the original printouts. Overall I'd say the conversion output quality with this method is very good.
I've got a 4" stack of wide-carriage COBOL. I guess it's two revisions of the same system so I only need to scan the newer half. Its probably from a TI Omni 810.
On the other hand, I've got 100 pages of code printed in compressed font by someone wanting to make sure that 80+ char lines fit within margins. So a lot of words just don't come out at all. A frequent error is "A" becomes "H", "O" becomes "U" because the top dots aren't "attached".
And columns of line numbers starting with 0001, or hex? The most confounding thing is OCR that thinks 00 is a sideways 8, and that dominates the uniform block, so it tries to interpret the whole column as sideways text. In another situation, it interprets two stacked lines (each starting with 0) as one line starting with 8 and it just goes off the rails.
So I've been working with automatic skew correction, then clipping it into rows, in order to get each line of text isolated from the surrounding context. When I do that, I get better results, but it is not great either.
I'm considering going all-in on training a new recognizer on snippets. For that, I'll be constructing "The Set of All As" and so on.
Maybe write them again?
For MS-DOS?
My work there was all new code and didn't involve any of that, however.
Finally, a sensible use case for BASIC's "READ" and "DATA" commands. Learning BASIC as a kid on a micro, it always struck me as an odd way to get input into a program. Sure, with INPUT, you'd have to hand enter your input every time, but baking into the program meant that you'd have to edit your program any time you wanted to change anything.
But with a card reader, you could "cut the deck". Keep the program cards, and then just stack on whatever set of data cards you wanted.
From this vantage point, in the 21st century with our flying cars and what not, it seems really quirky that back then, even your data could be a tangible thing.
...as writing paper.
1. you were using a DECwriter dot matrix printer as a terminal
2. using an ASR-33 teletype as a terminal
3. using punch cards or paper tape
4. using a glass tty that could only display 24 lines
5. when you did not have a remote terminal, and wanted to spread your code out on a table and debug it
Really depends on the program. Source code is often quite manageable. Even artifacts aren't always as large as you might expect. Busybox on my system weighs in at 1.9 MiB or alternatively 928 KiB with zstd maxed out.
But I don't really see a point to printing any of it. A situation that might require the printouts is likely to largely preclude the continued existence of modern electronics, the ability to replace batteries, or even a connection to a reliable electrical grid.
First of all, that comment is weirdly out of place. The quality and longevity of paper is not the topic.
Secondly, there are fragments of paper with writing as old as 2,000 years.
Thirdly, paper you look at and see the writing. With digital documents, you need the technology to read the medium and then you need to know how the information was encoded onto the medium, before you even arrive at the same level with paper, where you can start to decide the actual writing.
Paper has brought us where we are today, and given us what we know about the past. Don't be so ignorant and dismissive.
barely
It sounds like this printout has deteriorated badly and was barely readable.