Retrieval Augmented Generation

Today we’ll discuss how to make our RAG (Retrieval Augmented Generation) applications more effective.
As we all know, LLMs are powerful tools on their own, but sometimes they fall short when it comes to providing reliable, up-to-date, or contextually relevant information.
This is where RAG comes in, acting as a kind of “knowledge engine” alongside the LLM.

These are tactics I also use in my own projects, so I wanted to share them here — both as a personal notebook and as something useful for you 😄.


Traditional RAG vs Agentic RAG

  • Traditional RAG: The user’s query is converted into embeddings, searched within a vector database, and the most relevant chunks are passed to the model for response generation.
  • Agentic RAG: In this approach, the LLM doesn’t just receive context; it also gets access to tools. The model can then call these tools when needed to query the knowledge base.

Both methods are powerful. However, in production-ready applications, as the scale grows (millions of embeddings), certain issues arise:

  • Query latency increases
  • Costs go up
  • The system becomes sluggish

To overcome these, the RAG pipeline should be optimized in three stages: Pre-Retrieval, Retrieval, and Post-Retrieval.


1. Pre-Retrieval Stage

The groundwork done before retrieval is critical for overall performance.

1.1 Data Indexing

Documents should be cleaned, split into smaller chunks, and indexed carefully.

  • Noisy or irrelevant data embedded into vectors increases cost and reduces retrieval quality.

1.2 Query Optimization

The user’s query should be aligned with the preprocessing steps used during embedding creation.

  • Example: If stopwords were removed before creating embeddings, the same should be done for the query.

1.3 Sliding Window

The overlap value between chunks must be tuned carefully.

  • Too little overlap → loss of information
  • Too much overlap → redundancy and cost
    This is especially critical for academic, legal, or medical documents.

1.4 Metadata

Adding metadata (date, topic, document title, etc.) to embeddings allows efficient filtering during retrieval.

  • Example: “Return only healthcare reports after 2024.”

1.5 Query Rewriting

Queries can be rewritten before going into the vector DB. This requires an additional LLM call, but results in more accurate retrieval.

  • Paraphrasing: “What causes climate change?” → “Causes of climate change”
  • Synonym Substitution: “What are the culprits of the climate catastrophe?” → “Causes of the climate crisis”
  • Sub-queries: Breaking a complex query into smaller, independent parts, then merging results

2. Retrieval Stage

This is where the actual search happens. Optimizations here directly impact speed and cost.

Instead of pure semantic search, combine keyword + semantic search.

  • Example: For the query “AI in Turkey”
    • Perform keyword search first to generate a smaller candidate set
    • Then apply semantic search within that smaller set
    • This avoids scanning millions of vectors unnecessarily

Metadata filters can drastically narrow down the search space.

  • Example: “Financial reports after 2022” → Apply a date filter before semantic search

3. Post-Retrieval Stage

Directly passing retrieved results to the LLM is not always optimal.

3.1 Prompt Compression

The retrieved chunks may be too long. In such cases, compress or summarize them.

  • Reduces token cost
  • Helps the model focus on the essential information

3.2 Re-Ranking

Not all retrieved results are of equal quality.

  • Use heuristic methods or a secondary model to re-rank results
  • Ensure the most relevant answer is placed at the top

Conclusion

RAG is indispensable for up-to-date and accurate information retrieval in LLM-based systems. But as scale grows, the pipeline must be optimized to avoid sluggishness.

  • Pre-Retrieval: Cleaning, indexing, query rewriting
  • Retrieval: Hybrid search, filtering
  • Post-Retrieval: Compression, re-ranking

Applying these techniques improves user experience while keeping costs manageable in production environments.

I was also inspired by the LLM Engineer’s Handbook while preparing this post — highly recommended if you’re interested.


In the next post, I plan to dive deeper into the advantages of “Agentic RAG” and explore real-world use cases.