upvote
Hmm, doesn't say anything about what OCR tools they used.

I've got a 4" stack of wide-carriage COBOL. I guess it's two revisions of the same system so I only need to scan the newer half. Its probably from a TI Omni 810.

On the other hand, I've got 100 pages of code printed in compressed font by someone wanting to make sure that 80+ char lines fit within margins. So a lot of words just don't come out at all. A frequent error is "A" becomes "H", "O" becomes "U" because the top dots aren't "attached".

And columns of line numbers starting with 0001, or hex? The most confounding thing is OCR that thinks 00 is a sideways 8, and that dominates the uniform block, so it tries to interpret the whole column as sideways text. In another situation, it interprets two stacked lines (each starting with 0) as one line starting with 8 and it just goes off the rails.

So I've been working with automatic skew correction, then clipping it into rows, in order to get each line of text isolated from the surrounding context. When I do that, I get better results, but it is not great either.

I'm considering going all-in on training a new recognizer on snippets. For that, I'll be constructing "The Set of All As" and so on.

reply
Pretty interesting. I wonder if a whitelist against certain columns in the output could help, e.g. this column can only contain valid x86 instructions (e.g. MOV is allowed, M0V is not), this column can only contain hexadecimal (1 is allowed but never "l"), etc. Probably more work than it's worth given the final line-by-line comparison that happens anyway.
reply