
Retrieval-Augmented Generation has become one of the most misused terms in AI. Vendors sell “RAG solutions.” Startups pitch “RAG platforms.” Enterprise buyers request “RAG capabilities” in RFPs. The problem is that RAG is not a product. It is a pattern – a way of combining retrieval and generation to produce grounded outputs from language models. And like any pattern, the value is entirely in the implementation.
We have built RAG systems for clients across industries – legal document search, customer support knowledge bases, internal tooling, compliance systems. The ones that work share a set of engineering fundamentals. The ones that fail almost always skip those fundamentals in favor of moving fast toward a demo.
This article is about the fundamentals. No framework pitches, no magic prompts. Just the engineering decisions that determine whether your RAG system is useful or a liability.
At its core, RAG is a three-step pattern:
That is it. Everything else – vector databases, embedding models, re-rankers, chunking strategies – is implementation detail. Important implementation detail, but detail nonetheless. When you lose sight of the pattern and focus only on the tooling, you end up optimizing the wrong things.
The goal of a RAG system is to get the right information into the context window at the right time. Every engineering decision you make should be evaluated against that goal.
Chunking is the process of splitting your source documents into pieces that can be embedded and retrieved independently. It is the first major decision in a RAG pipeline and one of the most consequential.
The naive approach is fixed-size chunking: split every document into 500-token blocks. This is easy to implement and almost always produces mediocre results. The problem is that fixed-size chunks have no semantic coherence. A 500-token window might split a paragraph in half, separate a question from its answer, or cut a code block at the worst possible point.
Better approaches:
Semantic chunking. Split documents at natural boundaries: paragraphs, sections, headings. This preserves the semantic coherence of each chunk. The trade-off is uneven chunk sizes, which can affect retrieval consistency.
Hierarchical chunking. Create chunks at multiple granularity levels: document, section, paragraph. At retrieval time, you can match at the paragraph level and then expand to the section level for context. This gives you precision in matching with completeness in context.
Overlapping chunks. Add overlap between adjacent chunks so that information at chunk boundaries is not lost. A 200-token overlap on 500-token chunks means that any content within 200 tokens of a boundary appears in two chunks. This is a simple technique that materially improves retrieval on boundary cases.
Entity-aware chunking. For structured domains (legal contracts, technical specifications, medical records), chunk along entity boundaries. A contract clause should be one chunk. A specification requirement should be one chunk. This requires domain-specific parsing but produces far better retrieval for domain-specific queries.
The right chunking strategy depends on your data and your queries. We typically start with semantic chunking plus overlap and iterate based on retrieval quality metrics. If you are not measuring retrieval quality – which specific chunks are being retrieved for which queries, and whether they contain the answer – you are flying blind.
def semantic_chunk(document: str, max_tokens: int = 500, overlap: int = 100) -> list:
paragraphs = document.split("\n\n")
chunks = []
current_chunk = []
current_length = 0
for para in paragraphs:
para_tokens = count_tokens(para)
if current_length + para_tokens > max_tokens and current_chunk:
chunks.append("\n\n".join(current_chunk))
# Keep last paragraph for overlap
overlap_paras = [current_chunk[-1]] if current_chunk else []
current_chunk = overlap_paras
current_length = count_tokens(overlap_paras[0]) if overlap_paras else 0
current_chunk.append(para)
current_length += para_tokens
if current_chunk:
chunks.append("\n\n".join(current_chunk))
return chunks
Your embedding model determines the quality of your semantic search. It maps text into a vector space where similar meanings are close together. Choosing the wrong embedding model is like building a search engine with a bad index – everything downstream suffers.
Key considerations:
Domain fit. General-purpose embedding models (OpenAI’s text-embedding-3-large, Cohere’s embed-v3) work well for general text. But if your domain has specialized vocabulary – medical, legal, financial – you may need a domain-adapted model or a model fine-tuned on your data. Test this empirically. Run your actual queries against your actual data with different embedding models and measure retrieval precision.
Dimensionality vs. performance. Higher-dimensional embeddings capture more nuance but cost more to store and search. For most applications, 1536 dimensions (OpenAI’s default) or 1024 dimensions provide a good balance. Going to 3072 dimensions rarely improves retrieval enough to justify the cost unless your corpus is very large and semantically dense.
Multilingual requirements. If your data includes multiple languages, you need an embedding model that handles multilingual content well. This is not a feature you bolt on later – it is a selection criterion at the start.
Consistency. Once you choose an embedding model, changing it means re-embedding your entire corpus. This is operationally expensive for large datasets. Choose carefully and plan for the possibility that you will need to re-embed if a significantly better model becomes available.
We have found that for most business applications, starting with a high-quality general-purpose model and investing effort in chunking and retrieval tuning produces better results than starting with a specialized embedding model. The embedding model is important, but it is rarely the bottleneck.
Vector similarity search returns results ranked by cosine similarity (or whatever distance metric your vector database uses). This ranking is a reasonable first pass, but it is not good enough for production systems. The problem is that embedding similarity is not the same as relevance.
A chunk might be semantically similar to a query without actually answering it. “What is our refund policy?” and “Our refund policy was last updated in March” are semantically similar but the second chunk is not useful if the user wants to know the actual policy.
This is where re-ranking comes in. After your initial vector search returns the top-K candidates (typically 20-50), a re-ranker model evaluates each candidate against the original query and re-orders them by relevance. Re-rankers like Cohere Rerank or cross-encoder models are significantly more accurate than vector similarity alone because they consider the query and document together, not independently.
The pipeline looks like this:
This two-stage retrieval consistently outperforms single-stage vector search in our benchmarks. The cost of the re-ranking step is minimal compared to the LLM call it feeds, and the quality improvement is substantial.
def retrieve_and_rerank(query: str, top_k: int = 5, candidate_pool: int = 50):
# Stage 1: Fast vector similarity search
query_embedding = embed(query)
candidates = vector_db.search(query_embedding, limit=candidate_pool)
# Stage 2: Re-rank with cross-encoder
scored = reranker.rank(query=query, documents=[c.text for c in candidates])
# Return top-k after re-ranking
reranked = sorted(scored, key=lambda x: x.score, reverse=True)[:top_k]
return reranked
Here is the thing nobody wants to hear: the single biggest predictor of RAG system quality is the quality of the underlying data. Not the embedding model. Not the vector database. Not the LLM. The data.
If your source documents are poorly written, outdated, contradictory, or incomplete, no amount of retrieval engineering will produce good answers. Garbage in, garbage out applies to RAG with extra force because the LLM will confidently synthesize garbage into articulate, well-structured garbage.
Data quality for RAG means:
Currency. Documents need to be current. If your knowledge base has three versions of the same policy document and two are outdated, the system will sometimes retrieve the wrong one. Implement version control for your source documents. When a document is updated, the old version’s embeddings should be removed or marked as superseded.
Consistency. If different documents contradict each other, the LLM will pick one arbitrarily (or worse, blend them into a confident-sounding contradiction). Identify and resolve conflicts in your source material before embedding.
Completeness. If the answer to a common query is not in your corpus, the system will either hallucinate or return an unhelpful response. Map your expected query patterns to your corpus coverage. Gaps are not failures of your RAG system – they are gaps in your content.
Metadata. Rich metadata (document type, date, author, department, access level) enables filtering at retrieval time. “What is the current vacation policy?” should only retrieve the most recent version of the HR policy document, not every document that mentions vacation. Metadata filtering before vector search is more effective than trying to handle this in the prompt.
We spend more time on data preparation than on any other part of RAG implementations. It is not glamorous work, but it is where the value is.
After building RAG systems for multiple clients, we have collected a reliable set of anti-patterns – things that look reasonable but consistently produce poor results.
Anti-pattern: Stuffing the context window. Retrieving 20 chunks and cramming them all into the prompt does not improve accuracy. It degrades it. LLMs have a well-documented tendency to pay more attention to the beginning and end of long contexts and less to the middle. Include fewer, higher-quality chunks. Three excellent chunks outperform fifteen mediocre ones.
Anti-pattern: Ignoring retrieval failures. When the retrieval step returns low-relevance results, the system should say “I don’t have enough information to answer this” rather than generating an answer from weak context. Implement a relevance threshold below which the system declines to answer. This is better for users and better for trust.
Anti-pattern: Static evaluation. Testing your RAG system once during development and then deploying it is not evaluation. Your data changes. Your users’ queries change. Models get updated. Implement continuous evaluation with a representative set of query-answer pairs that runs on a schedule and alerts when quality degrades.
Anti-pattern: Treating all documents equally. A casual Slack message and a legal compliance document should not have the same authority in your system. Implement source weighting that gives higher retrieval scores to authoritative sources. This can be done in metadata filtering, in the re-ranking step, or in the prompt itself.
Anti-pattern: Skipping the hybrid search. Pure vector search misses exact matches. If a user searches for “policy 4.2.1” and that exact string appears in a document, vector search might not surface it because the embedding model does not preserve exact string identity. Hybrid search – combining vector similarity with keyword matching (BM25) – catches these cases. Most production RAG systems need both.
After iterating through multiple implementations, the architecture we recommend for production RAG systems looks like this:
Ingestion pipeline. Documents flow through a processing pipeline that handles parsing, cleaning, chunking, metadata extraction, and embedding. This pipeline should be idempotent and re-runnable – when you change your chunking strategy, you can re-process the entire corpus.
Dual index. A vector index for semantic search and a keyword index for exact matching. Queries run against both indexes and results are merged before re-ranking.
Re-ranking layer. A cross-encoder re-ranker that evaluates candidates against the original query. This is the most impactful single addition to a naive RAG system.
Context assembly. The top re-ranked results are assembled into a context block with source attribution. Each chunk includes metadata (source document, date, relevance score) that the LLM can use to cite its sources.
Generation with grounding instructions. The LLM prompt includes explicit instructions to only answer based on the provided context, to cite sources, and to indicate when the context is insufficient. These instructions are part of the system prompt, not the user query.
Evaluation pipeline. A continuous evaluation system that measures retrieval precision, answer accuracy, and hallucination rate against a maintained test set.
None of these components is exotic. There is no secret sauce. The value is in getting each step right and connecting them into a reliable pipeline. RAG is a pattern, and like any pattern, mastery comes from disciplined execution of the fundamentals.
RAG is not always the right approach. It works well when you have a large, changing corpus of documents and users ask questions that can be answered by retrieving the right subset. It works less well when:
Knowing when not to use RAG is as important as knowing how to implement it well. We cover the decision framework between RAG, fine-tuning, and agents in a separate article, because getting the architecture choice right is the first engineering decision that matters.
RAG is a powerful pattern. But it is a pattern, not a product. Treat it like engineering, not like purchasing, and you will get results worth deploying.