undefined

points

by Oras7 hours ago |

comments

by joss826 hours ago|

[-]

I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you.

OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.

by chpatrick7 hours ago|

prev|

[-]

It absolutely hasn't been solved, it's just got pretty decent in recent years.

by malfist6 hours ago|

parent|

[-]

Pretty decent might be quiet the stretch. I'd term it almost acceptable, but only if you're using commercial solutions like amazon's textract, doing it with open source tools is at best, extremely painful and vaguely accurate.

by chpatrick4 hours ago|

parent|

[-]

PaddleOCR (also from Baidu) is pretty damn good actually.

by __rito__2 hours ago|

parent|

[-]

I have shipped with PaddleOCR to prod. Works pretty well. (Usage limited to printed documents in Anglosphere). Runs fully offline, in CPU.

by gettingoverit4 hours ago|

prev|

[-]

Is it? I've never seen a single OCR that would replace a human just typing it by hand.

What if the goal is something actually useful, such as converting scientific paper PDF back to LaTeX that renders into a pixel-perfect copy? What about converting tables from electronics datasheets into computer-readable form? I wouldn't even expect it in the next decade.

by SyneRyder3 hours ago|

parent|

[-]

I've had success with vision models & OCR, saved me many hours / days / weeks of typing work.

Last year I finally OCR'd many hundreds of pages of my father's old writings. I found that feeding it to Claude Sonnet 4.x via API gave me results that were perfect. No corrections required. So perfect, that Claude was reading along with the story, and actually pointed out a continuity error in the story where an incorrect character was reciting dialog. Claude asked if it should transcribe exactly as is or if I would like Claude to correct the continuity error.

Claude also correctly OCR'd some handwriting that was in the margins of the documents. Sonnet came very close to transcribing a Word Sleuth puzzle, but that was where I hit the limits of its capability at the time.

Mistral OCR was also good (and actually what I started with), but it wasn't quite as good as Claude. And when it was wrong, Mistral could be frighteningly wrong - one API call must have failed, the model must have been presented with a pure black / null image, and I got back a "transcription" that described neverending darkness. It read like something the Woodsman would have broadcast in Twin Peaks S3E8. That poor model.

Tables from electronics datasheets might be okay, I think I've had success with OCR of technical manuals with tables for 80s synthesizer hardware. But I admit my use cases don't crossover into transcriptions of equations or graphs.

by sscaryterry7 hours ago|

prev|

[-]

Detecting characters almost, layout no.

by wongarsu6 hours ago|

parent|

[-]

Exactly my experience. If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. Vision-llms can improve a bit on character recognition, but at the cost of harder to detect failure modes.

But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)

Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen

by ljouhet6 hours ago|

prev|

[-]

Real question: what tool do you use? (for long/complex documents with tables, code, maths)

- marker (with --force-ocr) gives me the best results

- Mistral OCR (seems really great, but I never managed to get it work)

- Mathpix (tried a long time ago)

- docling (gives me garbage, I must use it wrong)

- Unlimited OCR (will try it)

- ???

by Oras6 hours ago|

parent|

[-]

- Azure Document Intelligence (has an option to return markdown too including headers and footers).

- AWS Textract

by badlibrarian5 hours ago|

parent|

[-]

Exactly. They're both very expensive and prone to surprising you. Sometimes in a good way, sometimes in a bad way. I'd rate them 85%, but you have to run a test because they both fail in different ways on the 15%.

by ai_fry_ur_brain5 hours ago|

parent|

prev|

[-]

poma-ai has really great chunking techniques that chunk the document based on the document structure/heirarchy.

We use it on 200 page IEEE standards that are notoriously complex, filled with tables and diagram. Highly reccomend.

by vulture9167 hours ago|

prev|

[-]

I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):

"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."

by mamcx4 hours ago|

prev|

[-]

Aside: what is the best to read receipts/bank statements/invoices?

by cannonpalms7 hours ago|

prev|

[-]

I guess, in theory, the prior distribution of language would allow for improved performance in some cases, especially where input quality is low.

by ta9887 hours ago|

parent|

[-]

This is already used in OCR, tesseract uses that.

by Aboutplants6 hours ago|

prev|

[-]

lol nope it hasn’t been solved. I deal with this constantly and we still have a longggg ways to go

by mschuster916 hours ago|

prev|

[-]

> I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.

Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.

Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

by ta9887 hours ago|

prev|

[-]

Cost, throughput, latency...

by Oras7 hours ago|

parent|

[-]

Traditional OCR is faster, cheaper, and much more reliable than LLMs

by j16sdiz6 hours ago|

parent|

[-]

If you consider non-English script, traditional OCR is not more reliable.

CJK have lots of character and high confusion rate.

Arabic scripts are complex and have lots of morphs.

Vietnamese have easily confused diacritics.

Thai have lots of non-standard fonts.

by ta9887 hours ago|

parent|

prev|

[-]

I don't think that's a universal statement that aplies to every kind of documents and languages. Mistral OCR is able to do things no "traditional" OCR was ever able to.

by JodieBenitez6 hours ago|

parent|

prev|

[-]

I wish it were. Alas...

by JohnKemeny7 hours ago|

prev|

[-]

OCR has definitely not "been solved long time ago", what are you talking about?

In your opinion, what is SOTA here?