upvote
LLMs, even the best ones, are still hit or miss wrt quality. Constantly improving, though.

I see more confusion from Opus 4.x about how to weight the different parts of a paper in terms of importance than I see hallucinations of flat out incorrect stuff. But these things still happen.

reply
surely, but it is a considerable concern? deflecting constructive feedback is probably not the best encouragement for others for a show HN?
reply
Hmmm, didn’t realize I was deflecting - just stating facts. But if I came across that way then criticism noted.

If I turned this into a paid app then more attention would be given to quality. There’s only so much an app that leverages LLMs can do, though. With enough trace data and user feedback I could imagine building out Evals from failure modes.

I can think of a few ways to provide a better UX. One is already built-in - there’s a “Recreate” button the original uploader can click if they don’t like the result.

Things could get pretty sophisticated after that, such as letting the user tweak the prompt, allowing for section-by-section re-dos, changing models, or even supporting manual edits.

From a commercial product perspective, it’s interesting to think about the cost/benefit of building around the current limits of LLMs vs building for an experience and betting the models will get better. The question is where to draw the line and where to devote cycles. Something worthy of its own thread.

reply