Building Reliable RAG: Chunking, Re-Ranking, and the Parts Nobody Talks About
RAG demos are everywhere; reliable RAG is rare. The parts that actually decide whether your retrieval works in production.
The naive RAG that fails in week two
Every RAG tutorial teaches the same stack. Split your documents at 500 tokens. Embed with text-embedding-3-small. Retrieve the top-3 chunks by cosine similarity. Stuff them into the prompt. Ship it.
It works beautifully in your eval set. You run 50 golden questions, get 80% accuracy, call it production-ready, and deploy.
Then real users show up.
They ask compound questions. They use abbreviations your documents don't contain. They ask about something mentioned once in a footnote on page 47. Accuracy craters — 40% if you're honest about it. You start getting Slack messages from the sales team about "the AI that makes stuff up."
The demo worked because your eval questions were written by someone who also wrote the documents. They use the same vocabulary. The chunks that matter are obvious. Real user queries have none of that. They're messy, ambiguous, and semantically distant from the source text in ways that cosine similarity over 500-token fixed chunks simply cannot bridge.
This post is about fixing that. Not with ML magic — with engineering decisions that the tutorials skip because they're harder to demo but easier to understand once you see the pattern.
Chunking strategies
The first thing to accept: fixed-size chunking is a compromise, not a default. Splitting every 500 tokens regardless of content structure guarantees that some chunks will cut a sentence in half, bury the most relevant clause at the end where it gets diluted, or lump three unrelated topics together. Your retrieval is only as good as your chunks.
Three approaches that actually matter:
Fixed-size with overlap is the baseline. If you use it, set overlap at 10-20% of your chunk size. For 512-token chunks, that's 50-100 tokens of repeated content between consecutive chunks. This prevents the situation where the one sentence that answers the query sits right at the boundary between two chunks and ends up in neither. Overlap is cheap; missing the answer is not.
Semantic chunking splits on meaning boundaries rather than token counts. You embed each sentence, watch for places where the embedding similarity drops sharply between consecutive sentences, and cut there. The resulting chunks are variable-length but semantically coherent. LlamaIndex and LangChain both have implementations. It's slower at ingestion time, but retrieval quality on prose-heavy documents — legal text, research reports, narrative content — is noticeably better.
Structural chunking is the one that earns its keep fastest. If your documents have headings, use them. A chunk that contains an H2, its first three paragraphs, and stops at the next H2 is almost always more useful than a chunk that stops at token 512, wherever that happens to land. For Markdown, HTML, DOCX — anything with structure — heading-aware splitting is the right default. The principle is simple: one chunk should contain one idea. Not one chunk per N tokens.
The test for whether your chunking is right: paste 20 random chunks into a doc and ask yourself if each one could stand alone and answer a specific question. If most of them couldn't, your chunk boundaries are wrong.
Embedding model and dimension trade-offs
text-embedding-3-small is fine for getting started. It is not the right answer for production if you care about retrieval quality.
The models worth knowing in 2026:
OpenAI text-embedding-3-large (3072 dimensions, or truncated to 1536) outperforms 3-small on most retrieval benchmarks by a meaningful margin. The cost difference is real but usually smaller than the engineering cost of debugging poor retrieval.
Voyage 3 (Voyage AI) consistently beats OpenAI on code-heavy and technical document retrieval. If your corpus is API documentation, engineering specs, or anything code-adjacent, Voyage 3 is worth a direct comparison before you lock in.
Cohere embed-v4 is designed for multilingual retrieval and handles cross-language queries better than either OpenAI or Voyage in most benchmarks. If your users query in Hindi but your documents are in English — or vice versa — this is the one to evaluate.
On dimensions: the 1536-dim default is not always right. Higher dimensions capture more semantic nuance but increase storage and query costs. For most use cases, 1024-1536 is the sweet spot. If you're using pgvector at scale, watch your index size carefully — the HNSW index memory footprint grows linearly with dimensions.
Domain fine-tuning earns its keep when your corpus uses specialized vocabulary that general embedding models don't understand well. Medical terminology, legal jargon, a very specific product taxonomy. Fine-tuning on contrastive pairs — (query, relevant chunk) — can lift retrieval recall by 15-25% in these domains. The bar: you need at least a few thousand labeled pairs, and you need to be sure the problem is vocabulary, not chunking or retrieval depth. Most teams that think they need fine-tuning actually need re-ranking.
Re-ranking
This is the single highest-leverage step in the entire RAG pipeline, and most teams don't run it.
Here is the problem with cosine similarity retrieval: it finds chunks that are semantically similar to the query, not chunks that are likely to answer the query. Those overlap heavily but are not the same thing. A chunk that contains lots of the same vocabulary as your question will score well even if it doesn't actually answer it.
Re-ranking fixes this. The pattern:
- Retrieve 20-30 candidate chunks using your fast vector search.
- Run a cross-encoder re-ranker that looks at each (query, chunk) pair jointly and scores how well the chunk actually answers the question.
- Take the top 3-5 by re-rank score instead of cosine score.
The cross-encoder can attend to fine-grained relationships between query and chunk that a bi-encoder (the architecture behind most embedding models) cannot. It's slower — but you're running it on 30 chunks, not millions, so latency is usually under 200ms.
import cohere
co = cohere.Client(api_key="...")
results = co.rerank(
model="rerank-v3.5",
query=user_query,
documents=[chunk.text for chunk in candidates], # top-30 from vector search
top_n=5,
)
reranked_chunks = [candidates[r.index] for r in results.results]
Cohere Rerank (rerank-v3.5) and Voyage Rerank are the managed options most teams reach for. BGE-Reranker-v2-m3 is the open-source alternative if you need to run inference on your own infrastructure — it's competitive in quality and free to host.
In our internal tests across three different production RAG systems, adding re-ranking lifted answer quality from around 60% to 90% on a standardized eval set. The two-step retrieve-then-rerank pattern is simply better than trying to get your vector search to do both jobs at once.
Retrieval observability
You cannot improve what you cannot see. Most RAG systems ship with exactly zero visibility into what the retriever is doing, and teams spend weeks debugging by vibes.
Log these four things for every query, from day one:
The retrieved chunks and their scores — both cosine scores from vector search and re-rank scores if you're re-ranking. When an answer is wrong, you need to know whether the right chunk was retrieved (and the LLM ignored it) or never retrieved at all (a retrieval problem, not a generation problem). These fail completely differently.
What the model cited — if your LLM is instructed to cite sources, log which chunk indices it referenced in its response. Chunks that are retrieved but never cited are candidates for removal or re-chunking. Chunks that are cited for wrong answers tell you where your source content is misleading.
Query latency by stage — vector search latency, re-rank latency, LLM latency, separately. This is how you find the bottleneck before it becomes a user complaint.
User feedback — thumbs up/down, or any explicit correction. This is gold. A small feedback dataset lets you build an eval set that reflects real production queries rather than golden questions you wrote yourself.
Build this dashboard before you have a problem, not after. The information is cheap to collect at query time and very expensive to reconstruct retroactively.
Refresh strategy
Source documents change. Your RAG system doesn't automatically know that. This is the part that bites teams hardest six months after launch.
Three approaches, with real trade-offs:
Full re-embed — delete everything, re-chunk, re-embed, re-index. Guarantees freshness. Costs scale linearly with corpus size. For a 10,000-document corpus embedded with text-embedding-3-large, a full refresh runs around $15-20 and 30-60 minutes. Fine as a nightly job for corpora that change slowly; painful for corpora that change continuously.
Incremental re-embed — track which source documents changed (via hash, timestamp, or webhook), re-embed only those documents, update only those vectors. Requires engineering to maintain document-level provenance in your vector store and handle deletions cleanly. Pinecone's metadata filtering and pgvector's row-level updates both support this. The engineering investment is real but the cost savings compound.
Version-aware indexing — the right answer for regulated or frequently-updated content. Each document version gets its own embedding set, tagged with an effective date. Queries can be parameterized by "as of" date. This handles the case where a user asks a question about a policy that changed last month and you need to answer based on the version that was in effect when an event happened. Almost no team ships this on day one, but if your domain requires it (insurance, legal, compliance), plan for it early.
The honest answer is that most teams ship without any refresh strategy, rely on manual re-embeds when something breaks badly enough that a user complains, and eventually build incremental refresh once the pain is obvious. Knowing the options before you launch at least lets you make that choice deliberately.
Reliable RAG is mostly engineering, not ML. The gap between a demo that impresses and a system that's still accurate six months after launch comes down to a handful of concrete decisions: how you chunk (structural beats fixed-size for structured documents), whether you re-rank (you should, always), what you log (chunks, scores, citations, feedback), and how you handle freshness (at minimum, a documented plan). None of this is academically interesting. All of it is what separates the systems that teams trust from the ones that get quietly turned off.
If you're earlier in the decision — still figuring out whether RAG is the right pattern for your product at all — RAG for SMBs is the place to start. It covers when retrieval-augmented generation earns its keep versus when a fine-tuned model or simple search is the better call. At Reveronix, the systems we build are designed to stay accurate in production, not just in demos — and these are the principles we apply across every engagement.
Written by the Reveronix team.
Ready to build something?
Keep reading

RAG for SMBs: When It's Worth It (And When It Isn't)
RAG is the most over- and under-prescribed AI pattern of 2026. A founder-friendly framework for deciding if it fits your business.
Read postThe DevOps Minimum for a 5-Person AI Startup
What DevOps actually has to look like for a tiny AI startup. The minimum that buys you sleep without burning runway.
Read postThe Remote-First Engineering Team in 2026: Tools That Survived the Hype
After five years of remote-first toolchain churn, here's the stack that actually shipped products in 2026.
Read post