upvote
Gemini Pro 3 seems to be built for handling multiple page PDFs.

I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).

reply
It's not that they can't do multiple pages... but did you compare against doing one page at a time?

How many pages did you try in a single request? 5? 50? 500?

I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot.

Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response.

One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly.

reply
I've been doing small PDFs- usually 5 or 6 pages in length.

I never tested Gemini 3 PDF OCR compared to individual images but I can say it processes a small 6 page PDF better than the retired Gemini 1.5 or 2 did individual images.

I agree that OCR and analysis should be two separate steps.

reply
You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.
reply
This is not always easy. The models I tried were too helpful and rewrote too much instead of fixing simple typos. When I tried I ended up with huge prompts and I still found sentences where the LLM was too enthusiastic. I ended up applying regexes with common typos and accepted some residual errors. It might be better now, though. But since then I’ve moved to all-in-one solutions like Mathpix and Mistral-OCR which are quite good for my purpose.
reply