upvote
Similar to my other comment, we assume that llamaparse and others can provide the individual page OCR. But once you have that the way that you can integrate it into your workflows often requires additional complexity around combining results from different sources. Here is a deeper dive I wrote on the complexities of building extraction pipelines: https://www.parsewise.ai/doc-processing-pipelines
reply
Mostly cross-doc reasoning at scale (e.g., 90k-page corpora) as opposed to doc-to-markdown conversions.
reply