upvote
If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.

But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.

reply
Gemini Pro 3 seems to be built for handling multiple page PDFs.

I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).

reply
It's not that they can't do multiple pages... but did you compare against doing one page at a time?

How many pages did you try in a single request? 5? 50? 500?

I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot.

Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response.

One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly.

reply
I've been doing small PDFs- usually 5 or 6 pages in length.

I never tested Gemini 3 PDF OCR compared to individual images but I can say it processes a small 6 page PDF better than the retired Gemini 1.5 or 2 did individual images.

I agree that OCR and analysis should be two separate steps.

reply
You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.
reply
This is not always easy. The models I tried were too helpful and rewrote too much instead of fixing simple typos. When I tried I ended up with huge prompts and I still found sentences where the LLM was too enthusiastic. I ended up applying regexes with common typos and accepted some residual errors. It might be better now, though. But since then I’ve moved to all-in-one solutions like Mathpix and Mistral-OCR which are quite good for my purpose.
reply
deleted
reply
I'm keeping my eye on progress in this area as well. I need to free engineering design data from tens of thousands of PDF pages and make them easily and quickly accessible to LLMs.
reply
All of healthcare is crying. Trust me.
reply
I suppose tears of joy?
reply
Of sadness because they're not allowed to use it yet.
reply
If your needs are that sensitive, I doubt you'll find anything anytime soon that doesn't require a human in the loop. Even SOTA models only average 95% accuracy on messy inputs. If that's a per character accuracy (which OCR is generally measured by), that's going to be 5+ errors per page of 100+ words. If you really can't afford mistakes you have to consider the OCR inaccurate. If you have key components like "days to respond" and "units vacant" you need to identify the presence of those specifically with bias in favor of false positives (over false negatives), and human confirmation of the source-> OCR.
reply
> If you really can't afford mistakes you have to consider the OCR inaccurate.

Isn’t this close to the error rate of human transcription for messy input, though? I seem to remember a figure in that ballpark. I think if your use case is this sensitive, then any transcription is suspicious.

reply
This is precisely the real question. If you're exceeding human transcription, you may be generally pretty good. The question is what happens when you tell a human to become surgical about some part of the document, how then does the comparison change..
reply
I’m sure you’ve tried all this but you’ve tried inter-rater agreement via multiple attempts on same LLM vs different LLM? Perhaps your system would work better if you ran it through 5 models 3 times and then highlighted diffs for human chooser.
reply
Deciphering fax messages? What is this, the 90s?
reply
We have decades of internal reports on film that we’d like to make accessible and searchable. We don’t do it with new documents, but we have a huge backlog.
reply
Fax is still hard to hack, so some organizations have kept it alive for security.
reply
I think the most useful thing about faxes, security-wise, is that in their basic form they require zero digital storage of the image being sent. The only record on either side of the transmission is a piece of paper.*

Contrast that with email, which is store-and-forward by design, and now you have to put in effort to ensure both the sending and receiving email providers delete the message in a timely manner.

* obviously you can add store-and-forward behavior to either fax machine, but it's not the default.

reply
deleted
reply