undefined

points

[-]

In the definition of RAG discussed here, that means the workflow looks something like this (simplified for brevity): When you send your query to the server, it will first normalise the words, then convert them to vectors, or embeddings, using an embedding model (there are also plain stochastic mechanisms to do this, but today most people mean a purpose-built LLM). An embedding is essentially an array of numeric coordinates in a huge-dimensional space, so [1, 2.522, …, -0.119]. It can now use that to search a database of arbitrary documents with pre-generated embeddings of their own. This usually happens during inserting them to the database, and follows the same process as your search query above, so every record in the database has its own, discrete set of embeddings to be queried during searches.

The important part here is that you now don’t have to compare strings anymore (like looking for occurrences of the word "fanfiction" in the title and content), but instead you can perform arbitrary mathematical operations to compare query embeddings to stored embeddings: 1 is closer to 3 than 7, and in the same way, fanfiction is closer to romance than it is to biography. Now, if you rank documents by that proximity and take the top 10 or so, you end up with the documents most similar to your query, and thus the most relevant.

That is the R in RAG; the A as in Augmentation happens when, before forwarding the search query to an LLM, you also add all results that came back from your vector database with a prefix like "the following records may be relevant to answer the users request", and that brings us to G like Generation, since the LLM now responds to the question aided by a limited set of relevant entries from a database, which should allow it to yield very relevant responses.

I hope this helps :-)

by johnathandos2 hours ago|

prev|

[-]

I think the example you give is a little backwards — a RAG system searches for relevant content before sending anything to the LLM, and includes any content retrieved this way in the generative prompt. User query -> search -> results -> user query + search results passed in same context to LLM.