← Back to Knowledge Base

RAG: Retrieval-Augmented Generation

techniques· 2 min read

Retrieval-Augmented Generation (RAG) is a technique that enhances LLM responses by providing relevant external information at query time. Instead of relying solely on knowledge learned during training, RAG systems retrieve up-to-date, domain-specific documents and include them in the model's context.

The RAG pipeline has three main stages. First, documents are processed and split into chunks, then converted into numerical representations (embeddings) and stored in a vector database. Second, when a user asks a question, the query is also converted to an embedding and used to find the most similar document chunks. Third, the retrieved chunks are included in the prompt alongside the user's question, giving the model specific context to generate an accurate answer.

Popular vector databases for RAG include Pinecone, Weaviate, Chroma, Qdrant, and pgvector (a PostgreSQL extension). Each offers different trade-offs in terms of scalability, cost, and features. For smaller projects, in-memory solutions or SQLite-based stores can work well.

Embedding models convert text into dense numerical vectors that capture semantic meaning. Popular choices include OpenAI's text-embedding-3-small, Cohere's embed models, and open-source options like BGE and E5. The quality of embeddings directly impacts retrieval accuracy.

Key challenges in RAG include chunking strategy (how to split documents effectively), retrieval quality (finding truly relevant passages), context window management (fitting retrieved content within token limits), and handling conflicting information from multiple sources.

Advanced RAG techniques include hybrid search (combining vector similarity with keyword matching), re-ranking (using a second model to score retrieved results), query decomposition (breaking complex questions into sub-queries), and iterative retrieval (multiple rounds of search and generation).

RAG is particularly valuable for enterprise applications where accuracy matters, information changes frequently, and the model needs access to proprietary data. It provides a practical alternative to fine-tuning for many use cases, with the advantage of being easier to update and maintain.