It was great when LLMs had 4,000 or 8,000 token context windows and the biggest challenge was efficiently figuring out the most likely chunks of text to feed into that window to answer a question.
These days LLMS all have 100,000+ context windows, which means you don't have to be nearly as selective. They're also exceptionally good at running search tools - give them grep or rg or even `select * from t where body like ...` and they'll almost certainly be able to find the information they need after a few loops.
Vector embeddings give you fuzzy search, so "dog" also matches "puppy" - but a good LLM with a search tool will search for "dog" and then try a second search for "puppy" if the first one doesn't return the results it needs.
The best way to search I think is a coding agent with grep and file system access, and that is because the agent can adapt and explore instead of one shotting it.
I am making my own search tool based on the principle of LoD (level of detail) - any large text input can be trimmed down to about 10KB size by doing clever trimming, for example you could trim the middle of a paragraph keeping the start and end, or you could trim the middle of a large file. Then an agent can zoom in and out of a large file. It skims structure first, then drills into the relevant sections. Using it for analyzing logs, repos, zip files, long PDFs, and coding agent sessions which can run into MB size. Depending on content type we can do different types of compression for code and tree structured data. There is also a "tall narrow cut" (like cut -c -50 on a file).
The promise is - any size input fit into 10KB "glances" and the model can find things more efficiently this way without loading the whole thing.
I guess RAG is faster? But I'm realizing I'm outdated now.
A few hundred lines of text is nothing for current LLMs.
You can dump the entire contents of The Great Gatsby into any of the frontier LLMs and it’s only around 70K tokens. This is less than 1/3 of common context window sizes. That’s even true for models I run locally on modest hardware now.
The days of chunking everything into paragraphs or pages and building complex workflows to store embeddings, search, and rerank in a big complex pipeline are going away for many common use cases. Having LLMs use simpler tools like grep based on an array of similar search terms and then evaluating what comes up is faster in many cases and doesn’t require elaborate pipelines built around specific context lengths.
When I last tried this with some Gemini models, they couldn't reliably identify specific scenes in a 50K word novel unless I trimmed down the context to a few thousands of words.
> Having LLMs use simpler tools like grep based on an array of similar search terms and then evaluating what comes up is faster in many cases
Sure, but then you're dependent on (you or the model) picking the right phrases to search for. With embeddings, you get much better search performance.
With current models it's very good.
Anthropic used a needle-in-haystack example with The Great Gatsby to demonstrate the performance of their large context windows all the way back in 2023: https://www.anthropic.com/news/100k-context-windows
Everything has become even better in the nearly 3 years since then.
> Sure, but then you're dependent on (you or the model) picking the right phrases to search for. With embeddings, you get much better search performance.
How do are those embeddings generated?
You're dependent on the embedding model to generate embeddings the way you expect.
Gemini 3 Pro fails to satisfy pretty straightforward semantic content lookup requests for PDFs longer than a hundred pages for me, for example.
Your original comment that I responded to said a "few hundred lines of text", not hundred page PDFs.