The surprisingly complex journey to text-selectable client-side generated PDFs

(sdocs.dev)

20 points

by FailMore1 days ago |

7 comments

by gobdovan8 minutes ago|

[-]

Thanks, this puts into perspective why copy-paste from PDFs is so bad.

I months into building a pasteboard transform library that normalises VS Code, Google Docs, PDFs and a bunch of Chromium apps provider-specific data so I can start pasting everything everywhere exactly how I want it. It's much, much messier than I expected.

Apps put different UTTypes on the pasteboard that are not really compatible with each other. Usually there's a plain text fallback, then rich text/HTML, then provider-specific data. You show how much insane work is needed just to make text selectable with glyph mappings, layout, links, code blocks, rendered styles, etc. But once you copy from that PDF, most viewers still only expose raw text, and often broken raw text at that...

by FailMore2 minutes ago|

parent|

[-]

Yep, it is a very interesting space for improvements imo. Kind of broadly speaking copy and paste is so central to working with a computer in a smooth way it should probably have more power / quality built into it (e.g. not having to install some random plug in to get clipboard history, etc.)

by ashishb9 minutes ago|

prev|

[-]

Software engineers drastically underestimates GUI - Web layouts, mobile app layouts, and even PDF layouts are non-trivial pieces of work to get right in all circumstances.

by FailMore3 minutes ago|

parent|

[-]

Yep, they (can) rarely enter your domain... so it's easy to assume its going to be trivial (maybe because things like .md or .txt files are trivial, so it's easy to think there's not much of a delta)

by 18 minutes ago|

prev|

[-]

deleted

by josefrichter1 hours ago|

prev|

[-]

It’s not that surprising. It’s one of those well known pandora boxes of web development: email templates, PDFs, printing,…

by FailMore57 minutes ago|

parent|

[-]

Ah, I didn't know that. It's not something I had worked on before, and the file format is highly prevalent (so I assumed things would be easy), so it was surprising to me

by SirHumphrey24 minutes ago|

parent|

[-]

Nothing about PDF is easy. Similarly to what once Tom Scott said about time zones, every time I must deal with PDFs I pray that PDF.js can be hacked in to doing it instead, otherwise I just don’t bother.

It’s on of the few examples when converting it in to picture and chucking it in a multimodal llm is a more sensible solution than trying to parse it.