undefined

points

[-]

Tesseract does not understand layout. It’s fine for character recognition, but if I still have to pipe the output to a LLM to make sense of the layout and fix common transcription errors, I might as well use a single model. It’s also easier for a visual LLM to extract figures and tables in one pass.

by chaps8 hours ago|

parent|

[-]

For my workflows, layout extraction has been so inconsistent that I've stopped attempting to use it. It's simpler to just throw everything into postgis and run intersection checks on size-normalized pages.

by kergonath7 hours ago|

parent|

[-]

Interesting. What kind of layout do you have?

My documents have one or two-column layouts, often inconsistently across pages or even within a page (which tripped older layout detection methods). Most models seem to understand that well enough so they are good enough for my use case.

by chaps7 hours ago|

parent|

[-]

Documents that come from FOIA. So, some scanned, some not. Lots of forms and lots of hand writing to add info that the form format doesn't recognize. Lots of repeated documents, but lots of one-off documents that have high signal.

by pogue5 minutes ago|

parent|

[-]

I'd be very curious what works well with FOIA historical documents that have been scanned by hand with redactions by markers & etc.

by fudged716 hours ago|

parent|

prev|

[-]

I don't know how, but PyMuPDF4LLM is based on Tessaract and has GNN-based layout detection

by chaps8 hours ago|

prev|

[-]

Tesseract v4 when it was released was exceptionally good and blew everything out of the water. Have used it to OCR millions of pages. Tbh, I miss the simplicity of tesseract.

The new models are similarly better compared to tesseract v4. But what I'll say is that don't expect new models to be a panacea for your OCR problems. The edge case problems that you might be trying to solve (like, identifying anchor points, or identifying shared field names across documents) are still pretty much all problematic still. So you should still expect things like random spaces or unexpected characters to jam up your jams.

Also some newer models tend to hallucinate incredibly aggressively. If you've ever seen an LLM get stuck in an infinite, think of that.

by notnullorvoid1 hours ago|

parent|

[-]

I used Tesseract v3 back in the day in combination with some custom layout parsing code. It ended up working quite well. When looking at many of the models coming out today the lack of accuracy scares me.

by 8 hours ago|

prev|

[-]

deleted