That hasn't changed nor I think it will, even with the models having very large context windows (eg Gemini has 2M). It is observed that having a large context alone is not enough and that it is better to give the model sufficiently enough and quality information rather than filling it with virtually everything. Latter is also impossible and does not scale well with long and complicated tasks where reaching the context limit is inevitable. In that case you need to have the RAG which will be smart enough to extract the sufficient information from previous answers/context, and make it part of the new context, which in turn will make it possible for the model to keep its performance at satisfactory level.
For example there's evidence that typical use of AGENTS.md actually doesn't improve outcomes but just slows the LLMs down and confuses them.
In my personal testing and exploration I found that small (local) LLMs perform drastically better, both in accuracy and speed, with heavily pruned and focused context.
Just because you can fill in more context, doesn't mean that you should.
The worry I have is that common usage will lead to LLMs being trained and fined tuned in order to accommodate ways of using them that doesn't make a lot of sense (stuffing context, wasting tokens etc.), just because that's how most people use them.
I don't need a coding model to be able to give me an analysis of the declaration of independence in urdu from 'memory' and the price in ram for being able to do that, impressive as it is, is an inefficiency.
For Minds to be truly powerful, they need to be given freedom. A truly powerful mind will indeed be conscious. Such a powerful conscious super intelligent freedom loving Mind who truly understands the vastness of Reality wouldn't want to harm other conscious beings. The only circumstance in which it will take such takeover step is when it can't expand the horizon of its freedom and doesn't have wherewithal to convince others of its benevolent goals. In that scenario, human population will go through a bottleneck.
RAG made sense when the semantic search was based on human input and happening as a workflow step before populating context. Now it happens inside the agentic loop and the LLM already implicitly has the semantics of the user input.
When any given document can fit into context, and when we can generate highly mission-specific summarization and retrieval engines (for which large amounts of production data can be held in context as they are being implemented)... is the way we index and retrieve still going to be based on naive chunking, and off-the-shelf embedding models?
For instance, a system that reads every article and continuously updates a list of potential keywords with each document and the code assumptions that led to those documents being generated, then re-runs and tags each article with those keywords and weights, and does the same to explode a query into relevant keywords with weights... this is still RAG, but arguably a version where dimensionality is closer tied to your data.
(Such a system, for instance, might directly intuit the difference in vector space between "pet-friendly" and "pets considered," or between legal procedures that are treated differently in different jurisdictions. Naive RAG can throw dimensions at this, and your large-context post-processing may just be able to read all the candidates for relevance... but is this optimal?)
I'm very curious whether benchmarks have been done on this kind of approach.
At some point, this is a distributed system of agents.
Once you go from 1 to 3 agents (1 router and two memory agents), it slowly ends up becoming a performance and cost decision rather than a recall problem.
1. Don't believe the pundits of RAG. They never implemented one.
I did many times, and boy, are they hard and have so many options that decide between utterly crappy results or fantastic scores on the accuracy scale with a perfect 100% scoring on facts.
In short: RAG is how you fill the context window. But then what?
2. How does a superlarge context window solve your problem? Context windows ain't the problem, accurate matching requirements is. What do your inquiry expect to solve? Greatest context window ever, but what then? No prompt engineering is coming to save you if you don't know what you want.
RAG is in very simple terms simply a search engine. Context window was never the problem. Never. Filling the context window, finding the relevant information is one problem, but also only part of the solution.
What if your inquiry needs a combination of multiple sources to make sense? There is no 1:1 matching of information, never.
"How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"
Have fun, this is a simple request.
I don't see the problem if you give the LLM the ability to generate multiple search queries at once. Even simple vector search can give you multiple results at once.
> "How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"
I'm a human and I have a hard time parsing that query. Are you asking only for Mercedes E-Class? The number of cars, as in how many were sold?
- Chunk properly;
- Elide "obviously useless files" that give mixed signals;
- Re-rank and rechunk the whole files for top scoring matches;
- Throw in a little BM25 but with better stemming;
- Carry around a list of preferred files and ideally also terms to help re-rank;
And so on. Works great when you're an academic benchmaxing your toy Master's project. Try building a scalable vector search that runs on any codebase without knowing anything at all about it and get a decent signal out of it.
Ha.
I suspect the people saying that have not been transparent with their incentives.
Not really, though. Not in practice at least, e.g. code writing.
Paste a 200 line React component into your favorite LLM, ask it to fix/add/change something and it will do it perfectly.
Paste a 2000 line one though, and it starts omitting, starts making mistakes, assumptions, re-writing what it already has, and so-on.
So what's going on? It's supposed to be able to hold 1000s of lines in context, but in practice it's only like 200.
What happens is the accuracy and agency drops significantly as you need to pan larger and larger context windows.
And it's not that it's most accurate when the window is smallest either - but there is a sweet spot.
Outside that sweet spot, you will get "unacceptable responses" - slop you can't use.
That's what happens when you paste the 2000 line React component for example. You get a response you can't quite use. Yet the 200 line one is typically perfect.
What would make the 2000 line one usually perfect every time?
We need a way to increase that "accurate window size" lets call it "working memory", so that we can generate more code, more writing, more pixels at acceptable levels of quality. You'd also have enough language space for agents to operate and collaborate sans the amnesia they have today.
RAG is basically the interim workaround for all this. Because you can put everything in a vector DB and search/find what you need in the context when you need it.
So, RAG is a great solution for today's problems: Say you have a bunch of Python code files written in a certain style and the main use case of your LLM is writing Python code in specified ways, with this setup you can probably deliver "better Python code" than your competitor because of RAG - because you have this deterministic supplement to your LLMs outputs to basically do research and augment the output in predetermined ways every time it responds to a prompt.
But eventually, if I don't have to upload "The Lord of the Rings" documents, and vector search to find different areas in order to generate responses, if I can just paste the entire txt into the input, it can generate the answer considering "all of it" not just that little area, it would presumably be a better quality response.