From theory to production: Retrieval-Augmented Generation explained and deployed
"I want GPT-4 to answer questions about my company's internal documentation. How do I teach it without spending millions on fine-tuning?"
RAG (Retrieval-Augmented Generation): Give the LLM relevant documents at query time. No training needed. Deploy in days, not months.
Instead of trying to stuff all your company knowledge INTO the model's parameters (expensive, slow, requires retraining), you store it in a searchable database and retrieve relevant pieces on-demand when users ask questions.
Convert documents → embeddings → store in vector database
100 documents → 500 chunks → 500 embeddings User query → embedding → find most similar documents
"return policy?" → search → top 5 relevant chunks Add retrieved documents to LLM prompt as context
prompt = system_msg + context + user_query LLM reads context and generates grounded answer
Answer based on YOUR docs, not training data This chapter provides complete, production-ready knowledge for building RAG systems—from first principles to deployment best practices.
Understanding when RAG is the right solution vs fine-tuning or long context
How to turn PDFs, websites, and documents into searchable pieces
100-page PDF
→ Extract text (50,000 words)
→ Split into 500-token chunks
→ Result: 250 chunks with metadata
Converting text to vectors and storing them for fast retrieval
Finding the RIGHT documents, not just similar ones
Crafting prompts that maximize LLM performance with retrieved context
Modern approaches that go beyond simple retrieve-and-generate
How do you know if your RAG system actually works?
Taking RAG from prototype to production at scale
Step-by-step guide with actual code
# Full RAG pipeline pseudocode
# 1. INDEX
docs = load_docs("company_handbook.pdf")
chunks = chunk_text(docs, size=500)
embeddings = create_embeddings(chunks)
vector_db.insert(chunks, embeddings)
# 2. QUERY
query = "What's our vacation policy?"
query_emb = create_embedding(query)
results = vector_db.search(query_emb, k=5)
# 3. AUGMENT + GENERATE
prompt = build_rag_prompt(query, results)
answer = llm.generate(prompt)
Turn any PDF into a chatbot that answers questions from its content
Your 100-page company handbook sits in a PDF. Employees ask the same questions repeatedly: "What's the vacation policy?", "How do I request equipment?", "What's the remote work rule?"
Instead of reading 100 pages, let's build a chatbot that answers from the PDF instantly. This tutorial uses ChromaDB (vector database), OpenAI embeddings, and GPT-4 to create a working RAG system.
Save this as pdf_chatbot.py and run it!
# pdf_chatbot.py - Turn a PDF into a chatbot in 50 lines
import os
from pathlib import Path
import chromadb
from chromadb.utils import embedding_functions
from PyPDF2 import PdfReader
from openai import OpenAI
# Configuration
PDF_PATH = "company_handbook.pdf"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)
# Step 1: Extract text from PDF
def extract_pdf_text(pdf_path):
"""Extract all text from PDF, page by page"""
reader = PdfReader(pdf_path)
pages = []
for i, page in enumerate(reader.pages):
text = page.extract_text()
pages.append({"text": text, "page": i + 1})
return pages
# Step 2: Chunk text with overlap
def chunk_text(pages, chunk_size=500, overlap=50):
"""Split text into overlapping chunks, preserving page metadata"""
chunks = []
for page in pages:
words = page["text"].split()
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append({"text": chunk, "page": page["page"]})
return chunks
# Step 3: Create vector database and index chunks
def index_chunks(chunks):
"""Store chunks in ChromaDB with OpenAI embeddings"""
chroma_client = chromadb.Client()
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=OPENAI_API_KEY, model_name="text-embedding-3-small"
)
collection = chroma_client.create_collection(
name="pdf_chatbot", embedding_function=openai_ef
)
collection.add(
documents=[c["text"] for c in chunks],
metadatas=[{"page": c["page"]} for c in chunks],
ids=[f"chunk_{i}" for i in range(len(chunks))]
)
return collection
# Step 4: Retrieve relevant chunks for a query
def retrieve(query, collection, k=3):
"""Find top k most relevant chunks using semantic search"""
results = collection.query(query_texts=[query], n_results=k)
return results["documents"][0], results["metadatas"][0]
# Step 5: Generate answer using LLM with retrieved context
def generate_answer(query, context_chunks, metadata):
"""Use GPT-4 to answer question based on retrieved context"""
context = "\n\n".join([f"[Page {m['page']}]: {c}"
for c, m in zip(context_chunks, metadata)])
prompt = f"""You are a helpful assistant. Answer the question using ONLY the context below.
If the answer isn't in the context, say "I don't have that information."
Context:
{context}
Question: {query}
Answer:"""
response = client.chat.completions.create(
model="gpt-4", messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Main pipeline
def main():
print("📄 Extracting PDF...")
pages = extract_pdf_text(PDF_PATH)
print("✂️ Chunking text...")
chunks = chunk_text(pages)
print(f" Created {len(chunks)} chunks")
print("🔍 Indexing in vector database...")
collection = index_chunks(chunks)
print("\n💬 Chatbot ready! Ask questions about the PDF.\n")
while True:
query = input("You: ")
if query.lower() in ["exit", "quit"]: break
docs, metadata = retrieve(query, collection)
answer = generate_answer(query, docs, metadata)
print(f"Bot: {answer}\n")
if __name__ == "__main__":
main() pip install chromadb PyPDF2 openai export OPENAI_API_KEY="sk-..." company_handbook.pdf in the same directory
python pdf_chatbot.py You: What is the vacation policy?Bot: According to page 12, employees receive... text-embedding-3-small model (cost: ~$0.02 per 1M tokens)
cos(query_vec, chunk_vec) is highest
stream=True for real-time responsesThis 50-line script demonstrates production RAG fundamentals: chunking strategy, embeddings, vector search, prompt engineering, and citation. Companies pay $100K+ for systems built on these exact principles. You just learned to build one in an afternoon.
Learn from others' mistakes
When to use RAG and when to use something else
Each chapter taught you powerful, standalone concepts. RAG is one application that shows how these ideas work together in production systems:
Companies use RAG to make internal documentation searchable and accessible. Employees get instant answers from company knowledge bases.
Law firms use RAG to search case law, contracts, and regulations. Answers with citations ensure accuracy and traceability.
Medical professionals use RAG to query research papers, treatment guidelines, and patient histories while maintaining privacy.
Support teams use RAG to answer customer questions from product docs, FAQs, and past tickets—accurately and consistently.
This chapter will be your complete guide to building production-ready RAG systems. We'll cover everything from fundamental concepts to advanced techniques, with practical code examples, cost analyses, and deployment strategies.
By the end, you'll understand not just how RAG works, but why design decisions matter, when to use different approaches, and how to debug issues when they arise.