The reason I'm not falling back on OCR is because the general case is full of things, like math equations and inset graphics/diagrams, that can't be OCR'd. The only robust way to deal with those is to treat them as graphical atoms: "this bounding box can be moved around, but should not be split up into pieces".