Understanding RAG
Retrieval Augmented Generation: the engine that surfaces your knowledge.
In short, instead of relying solely on the LLM's internal training data (which can be outdated or hallucinatory), RAG allows feeding the model relevant information from your private knowledge base before generating a response.
What is RAG?
Large language models are trained on a fixed dataset. Once training ends, their knowledge is frozen — they have no awareness of documents you wrote last week, internal reports, or anything outside what they were trained on. Retrieval-Augmented Generation (RAG) is a way to work around this.
Instead of baking knowledge into the model, you keep it in a searchable document store. When a question comes in, the system retrieves the most relevant passages from that store and hands them to the model alongside the question. The model then answers based on what it was just shown, rather than what it vaguely remembers from training.
The result is a system that can be updated by adding documents, stays grounded in source material, and is less prone to inventing answers.
How it works
Ingestion
Before any question can be answered, the documents need to be indexed. This happens once — or whenever the document collection changes.
Documents: The starting point: any files you want to make searchable. PDFs, Word documents, plain text, web pages. The system does not care about topic or format as long as text can be extracted.
Parse: Raw files are read and converted to plain text/markdown format. For PDFs this means extracting the text layer; for scanned documents it means running OCR. Images, headers, and formatting are stripped away — what matters is the textual content.
Chunk: A full document is usually too long to work with directly. It gets split into shorter passages — typically a paragraph or a few hundred words each. Smaller chunks are more precise: when retrieved, they tend to contain a focused piece of information rather than a mix of unrelated content.
Embed: Each chunk is passed through an embedding model, which converts it into a list of numbers — a vector. This vector captures the semantic meaning of the text in a form the computer can compare. Two chunks that talk about the same subject will end up with similar vectors, even if they use different words.
Vector Store: The vectors and their corresponding text chunks are saved to a database. This is the index the system will search at query time. It persists on disk, so ingestion only needs to run again when documents are added or changed.
Query
Once the index exists, this is what happens every time a user asks a question to the RAG system.
Question: A user types a question in plain language. No special syntax required. A few things are worth keeping in mind: more context generally helps — "what are the safety requirements for lithium batteries in cargo aircraft?" will return better results than "battery safety". Precise and focused beats broad and vague.
Embed: The question goes through the same embedding model used during ingestion. This produces a vector that represents what the question is asking about — in the same numerical space as the stored chunk vectors.
Retrieve: The question vector is compared against all stored chunk vectors. The chunks with the closest vectors — the ones most semantically similar to the question — are returned. Typically the top five to ten.
Build Prompt: The retrieved chunks and the original question are assembled into a prompt. A system prompt is added to give the model instructions about tone, format, or scope. This is what actually gets sent to the language model.
LLM: The language model reads the prompt and generates an answer. It is important to understand that the model has never seen your documents — it was not trained on them and has no memory of previous queries. Its role here is reasoning and language, not knowledge storage.
Answer: The model's response is returned to the user. The answer is grounded in the retrieved passages — not in what the model was trained to believe.
How to refine the RAG process
The core RAG loop described above works well, but it has weak points. Semantic search is good at finding conceptually related content, but it can miss documents that happen to use different vocabulary. A single query formulation may not surface everything relevant. And the top-K chunks by vector distance are not always the most useful ones — proximity in vector space is an approximation, not a guarantee. Lancy adds several optional stages that address these limitations.
Keyword search and result fusion
Semantic search finds chunks that are about the same thing as your question. Keyword search — implemented here as BM25 — finds chunks that contain the same words as your question. These two approaches fail in different ways: semantic search can miss an exact technical term, while keyword search cannot handle paraphrasing or synonyms. Running both in parallel and combining the results covers more ground than either alone.
The combination is handled by Reciprocal Rank Fusion (RRF). Rather than merging raw similarity scores, which are on different scales and hard to compare, RRF looks only at rank positions. A chunk that appears high in both the semantic list and the keyword list ends up ranked higher in the merged result than a chunk that ranks well in only one. The output is a single re-ordered list that reflects agreement between both retrieval methods.
Query expansion
A question as typed by a user is one specific phrasing of an information need. Relevant content in the documents might be written very differently. Query expansion asks a language model to rewrite the original question into several alternative formulations — different angles on the same topic, different vocabulary, different levels of specificity.
Each reformulation is then used as a separate retrieval query, and all results are pooled before ranking. This increases the chance of surfacing relevant chunks that a single phrasing would have missed. A question like "onboarding steps" might expand to "how do I get started", "initial setup procedure", and "first-time configuration" — each of which may match different parts of the documentation.
HyDE — Hypothetical Document Embedding
There is a structural mismatch between a short question and a long answer passage: in vector space, questions and their corresponding answers do not always sit close to each other, even when they are clearly related. HyDE (Hypothetical Document Embedding) works around this by inverting the problem.
Instead of embedding the question directly, the system first asks the language model to write a hypothetical answer — a short passage that looks like it could have come from a document. That hypothetical text is then embedded and used as the retrieval query. The idea is that a fake answer and a real answer will be closer together in vector space than a question and an answer.
Reranking
Vector similarity is a fast but coarse filter. The top-K chunks returned by retrieval are good candidates, but their ranking is based on a single number — distance in a high-dimensional space. Reranking replaces that ordering with a more careful judgment. A language model reads each candidate chunk alongside the original question and scores how relevant the chunk actually is to answering it.
This is slower than vector search, so it is only applied to a shortlist, not the full index. The best-scoring chunks from that shortlist then go into the prompt. The benefit is that chunks which sounded related by vector distance but do not actually answer the question get filtered out, and chunks with precise relevant content move up even if they were not at the top of the initial retrieval.
Individually, each of these stages is a targeted fix for a specific failure mode. Together, they close enough of the gaps that the system stops feeling like a search engine and starts feeling like something that genuinely understood what you were asking and has the knowledge of your internal documents at its fingertips.