Coming Soon

RAG: Building AI That Knows Your Data

From theory to production: Retrieval-Augmented Generation explained and deployed

The Problem

🤔

The Question

"I want GPT-4 to answer questions about my company's internal documentation. How do I teach it without spending millions on fine-tuning?"

✨

The Answer

RAG (Retrieval-Augmented Generation): Give the LLM relevant documents at query time. No training needed. Deploy in days, not months.

The Key Insight

Instead of trying to stuff all your company knowledge INTO the model's parameters (expensive, slow, requires retraining), you store it in a searchable database and retrieve relevant pieces on-demand when users ask questions.

The RAG Pipeline

INDEX

Convert documents → embeddings → store in vector database

100 documents → 500 chunks → 500 embeddings

↓

RETRIEVE

User query → embedding → find most similar documents

"return policy?" → search → top 5 relevant chunks

↓

AUGMENT

Add retrieved documents to LLM prompt as context

prompt = system_msg + context + user_query

↓

GENERATE

LLM reads context and generates grounded answer

Answer based on YOUR docs, not training data

Real Example

User asks:

"What's the company's remote work policy?"

→

System retrieves:

3 relevant chunks from HR handbook:
• Section 4.2: Remote Work Guidelines
• Section 4.3: Equipment Reimbursement
• Section 7.1: Communication Expectations

→

LLM generates:

"According to Section 4.2, employees can work remotely up to 3 days per week..." [with citations]

What You'll Learn

This chapter provides complete, production-ready knowledge for building RAG systems—from first principles to deployment best practices.

Why RAG? The Business Case

Understanding when RAG is the right solution vs fine-tuning or long context

The hallucination problem: why LLMs "make things up"
Knowledge cutoff dates and outdated information
Domain-specific knowledge (legal, medical, internal docs)
Cost comparison: RAG vs fine-tuning vs long-context
When NOT to use RAG (and what to use instead)

Cost reality check: Fine-tuning GPT-4 on custom data: $500K+. Building RAG: $100-500/month.

Document Processing & Chunking

How to turn PDFs, websites, and documents into searchable pieces

Chunking strategies: Fixed-size, semantic, sentence-based
Chunk size trade-offs: 200 vs 500 vs 1000 tokens
Overlap between chunks (why 50 tokens overlap matters)
Preserving metadata: page numbers, sections, dates, authors
Handling tables, images, code blocks

Document → Chunks

100-page PDF

→ Extract text (50,000 words)

→ Split into 500-token chunks

→ Result: 250 chunks with metadata

Embeddings & Vector Databases

Converting text to vectors and storing them for fast retrieval

Creating embeddings (OpenAI, Cohere, open-source models)
Vector databases: Pinecone, Weaviate, ChromaDB, FAISS
Semantic search with cosine similarity (Chapter 4 callback!)
Approximate Nearest Neighbors (ANN) for speed
Indexing strategies: HNSW, IVF, Product Quantization

Math connection: Each chunk = 768-dimensional vector. 10K chunks = 10,000 × 768 matrix. Search = finding closest vectors using cosine similarity.

Retrieval Strategies

Finding the RIGHT documents, not just similar ones

Naive retrieval: Simple similarity search
Hybrid search: Semantic + keyword (BM25)
Re-ranking: Two-stage retrieval for precision
Metadata filtering: Date ranges, authors, document types
Query transformation: Rewriting, expansion, decomposition

Basic

Query → Find 5 similar docs

Precision: 60%

Advanced
Query + metadata + re-rank
Precision: 85%

Prompt Engineering for RAG

Crafting prompts that maximize LLM performance with retrieved context

Optimal prompt structure for RAG
Context ordering: most relevant first or last?
Token budget management (context limits)
Instructions: "Only answer from context" vs "Use context to help"
Citation generation: teaching LLMs to cite sources

System: You are a helpful assistant. Answer using ONLY the context provided below.

Context:
[Retrieved Document 1]
[Retrieved Document 2]
[Retrieved Document 3]

Question: {user_question}

Advanced RAG Architectures

Modern approaches that go beyond simple retrieve-and-generate

Agentic RAG: LLM decides what and when to retrieve
Multi-hop retrieval: Iterative search for complex queries
Self-RAG: Model reflects on retrieved quality
CRAG: Corrective RAG with web search fallback
Graph RAG: Knowledge graph + vector retrieval

Evolution: Naive RAG (retrieve once) → Advanced RAG (query rewriting) → Agentic RAG (iterative, self-reflective)

Evaluation & Metrics

How do you know if your RAG system actually works?

Retrieval metrics: Precision@k, Recall@k, MRR, NDCG
Faithfulness: Does answer use only retrieved context?
Answer quality: Accuracy, completeness, relevance
LLM-as-judge: Using GPT-4 to evaluate responses
Creating test sets and benchmarks

Retrieval Quality

Are we finding the right docs?

Faithfulness

No hallucinations?

Answer Quality

Is it actually helpful?

Production Deployment

Taking RAG from prototype to production at scale

Latency optimization: Caching, batching, async retrieval
Cost management: Embedding costs, vector DB, LLM calls
Scaling: 1M documents, 1000s concurrent users
Incremental updates: Adding new docs without re-indexing all
Monitoring: Tracking quality, failures, user satisfaction

Real numbers: 10K queries/day = $100-500/month. Latency: 1-3 seconds/query. Accuracy: 85-95% with good retrieval.

Hands-On: Building Your First RAG

Step-by-step guide with actual code

Setting up a vector database (ChromaDB/Pinecone)
Loading and chunking documents
Creating embeddings with OpenAI/Cohere API
Implementing semantic search
Building the RAG prompt and calling the LLM
Adding re-ranking and metadata filtering


# Full RAG pipeline pseudocode
 

# 1. INDEX

docs = load_docs("company_handbook.pdf")

chunks = chunk_text(docs, size=500)

embeddings = create_embeddings(chunks)

vector_db.insert(chunks, embeddings)
 

# 2. QUERY

query = "What's our vacation policy?"

query_emb = create_embedding(query)

results = vector_db.search(query_emb, k=5)
 

# 3. AUGMENT + GENERATE

prompt = build_rag_prompt(query, results)

answer = llm.generate(prompt)

🎯 Build a PDF Chatbot in 50 Lines

Turn any PDF into a chatbot that answers questions from its content

The Story: Your Company Handbook

Your 100-page company handbook sits in a PDF. Employees ask the same questions repeatedly: "What's the vacation policy?", "How do I request equipment?", "What's the remote work rule?"

Instead of reading 100 pages, let's build a chatbot that answers from the PDF instantly. This tutorial uses ChromaDB (vector database), OpenAI embeddings, and GPT-4 to create a working RAG system.

Complete Working Code

Save this as pdf_chatbot.py and run it!

# pdf_chatbot.py - Turn a PDF into a chatbot in 50 lines

import os
from pathlib import Path
import chromadb
from chromadb.utils import embedding_functions
from PyPDF2 import PdfReader
from openai import OpenAI

# Configuration
PDF_PATH = "company_handbook.pdf"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

# Step 1: Extract text from PDF
def extract_pdf_text(pdf_path):
    """Extract all text from PDF, page by page"""
    reader = PdfReader(pdf_path)
    pages = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text()
        pages.append({"text": text, "page": i + 1})
    return pages

# Step 2: Chunk text with overlap
def chunk_text(pages, chunk_size=500, overlap=50):
    """Split text into overlapping chunks, preserving page metadata"""
    chunks = []
    for page in pages:
        words = page["text"].split()
        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i:i + chunk_size])
            chunks.append({"text": chunk, "page": page["page"]})
    return chunks

# Step 3: Create vector database and index chunks
def index_chunks(chunks):
    """Store chunks in ChromaDB with OpenAI embeddings"""
    chroma_client = chromadb.Client()
    openai_ef = embedding_functions.OpenAIEmbeddingFunction(
        api_key=OPENAI_API_KEY, model_name="text-embedding-3-small"
    )
    collection = chroma_client.create_collection(
        name="pdf_chatbot", embedding_function=openai_ef
    )
    collection.add(
        documents=[c["text"] for c in chunks],
        metadatas=[{"page": c["page"]} for c in chunks],
        ids=[f"chunk_{i}" for i in range(len(chunks))]
    )
    return collection

# Step 4: Retrieve relevant chunks for a query
def retrieve(query, collection, k=3):
    """Find top k most relevant chunks using semantic search"""
    results = collection.query(query_texts=[query], n_results=k)
    return results["documents"][0], results["metadatas"][0]

# Step 5: Generate answer using LLM with retrieved context
def generate_answer(query, context_chunks, metadata):
    """Use GPT-4 to answer question based on retrieved context"""
    context = "\n\n".join([f"[Page {m['page']}]: {c}"
                           for c, m in zip(context_chunks, metadata)])
    prompt = f"""You are a helpful assistant. Answer the question using ONLY the context below.
If the answer isn't in the context, say "I don't have that information."

Context:
{context}

Question: {query}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4", messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Main pipeline
def main():
    print("📄 Extracting PDF...")
    pages = extract_pdf_text(PDF_PATH)

    print("✂️  Chunking text...")
    chunks = chunk_text(pages)
    print(f"   Created {len(chunks)} chunks")

    print("🔍 Indexing in vector database...")
    collection = index_chunks(chunks)

    print("\n💬 Chatbot ready! Ask questions about the PDF.\n")

    while True:
        query = input("You: ")
        if query.lower() in ["exit", "quit"]: break

        docs, metadata = retrieve(query, collection)
        answer = generate_answer(query, docs, metadata)
        print(f"Bot: {answer}\n")

if __name__ == "__main__":
    main()

How to Run This

Install dependencies:
pip install chromadb PyPDF2 openai
Set your OpenAI API key:
export OPENAI_API_KEY="sk-..."
Add a PDF file:
Place company_handbook.pdf in the same directory
Run it:
python pdf_chatbot.py
Ask questions:
You: What is the vacation policy?
Bot: According to page 12, employees receive...

What Each Part Does

Lines 14-21: Extract

Story anchor: "Read the PDF like a human would"
Uses PyPDF2 to pull text from each page, keeping page numbers for citations

Lines 23-31: Chunk

Story anchor: "Break book into paragraph-sized pieces"
Splits into 500-word chunks with 50-word overlap so context isn't lost at boundaries

Lines 33-47: Index

Story anchor: "Create a smart search index"
Converts each chunk to a 1536-dimension vector using OpenAI embeddings, stores in ChromaDB

Lines 49-53: Retrieve

Story anchor: "Find the 3 most relevant paragraphs"
Converts query to vector, finds top 3 closest chunks using cosine similarity

Lines 55-72: Generate

Story anchor: "Answer the question using only what you found"
Feeds retrieved chunks + query to GPT-4, instructs it to cite page numbers

🔬 What's Happening Under the Hood

Embeddings: Each chunk becomes a 1536-dimensional vector using OpenAI's text-embedding-3-small model (cost: ~$0.02 per 1M tokens)

Similarity: ChromaDB uses cosine similarity (Chapter 4!) to find chunks where cos(query_vec, chunk_vec) is highest

Context window: 3 chunks × 500 words = ~2000 tokens fed to GPT-4. Fits easily in 128K context window

Cost per query: ~$0.001 per question (3 chunks retrieval + GPT-4 generation)

🚀 Extend This Further

Add re-ranking: Retrieve 10 chunks, re-rank with Cohere, use top 3
Multiple PDFs: Index entire folder, add filename to metadata
Web UI: Wrap in Streamlit/Gradio for non-technical users
Streaming: Use stream=True for real-time responses
Evaluation: Track which questions get good vs bad answers

💡 Why This Tutorial Matters

This 50-line script demonstrates production RAG fundamentals: chunking strategy, embeddings, vector search, prompt engineering, and citation. Companies pay $100K+ for systems built on these exact principles. You just learned to build one in an afternoon.

Common Pitfalls & Best Practices

Learn from others' mistakes

❌ Chunks too small (lose context) or too large (irrelevant info)
❌ Ignoring metadata (dates, sources, document types)
❌ Pure semantic search (missing keyword matches)
❌ Not evaluating retrieval quality separately from LLM
❌ Hardcoding prompts without A/B testing
✅ Always test on edge cases and adversarial queries
✅ Build feedback loops (users rate answer quality)

RAG vs Other Approaches

When to use RAG and when to use something else

RAG vs Fine-tuning: Cost, flexibility, data requirements
RAG vs Long context: When 10M tokens is better than retrieval
RAG vs Agents: Combining both for complex workflows
Hybrid approaches: RAG + fine-tuning together

Need to teach LLM new facts?
→ Frequently changing? Use RAG
→ Rarely changing + behavioral? Fine-tune
→ Can fit in context? Use long context

Seeing the Concepts in Action

Each chapter taught you powerful, standalone concepts. RAG is one application that shows how these ideas work together in production systems:

Chapter 1: Pattern Discovery

You learned how models learn from data using gradient descent. In RAG, these same optimization principles train embedding models and LLMs.

Chapter 2: Applications

You learned to apply ML to real problems. In RAG, this mindset guides you from prototype to production deployment.

Chapter 3: Classification

You learned to make categorical decisions. In RAG, classification powers re-ranking, relevance scoring, and document filtering.

Chapter 4: Vectors

You learned cosine similarity for measuring vector similarity. In RAG, this powers semantic search between queries and documents.

similarity = cos(query_vec, doc_vec)

Chapter 5: Matrices

You learned batch matrix operations. In RAG, this lets us compare one query against 10,000 documents efficiently.

(1 × 768) · (10000 × 768)ᵀ = scores

Chapter 6: How LLMs Work

You learned how Y = XW + b processes text. In RAG, this is how the LLM reads retrieved context and generates answers.

Chapter 7: Embeddings

You learned how text becomes vectors. In RAG, every document chunk and query uses these embeddings for search.

"return policy" → [0.23, -0.51, 0.89, ...]

Chapter 8: Non-Linearity

You learned why activation functions enable deep learning. In RAG, these same non-linearities power the neural networks that create embeddings.

Chapter 9: Attention

You learned how attention focuses on relevant information. In RAG, this determines how the LLM weighs different retrieved documents.

Chapter 10: Architectures

You learned about context windows and efficiency. In RAG, these constraints inform how many documents we can retrieve.

Real-World Impact

🏢

Enterprise Search

Companies use RAG to make internal documentation searchable and accessible. Employees get instant answers from company knowledge bases.

90% reduction in "where is this document?" questions

⚖️

Legal & Compliance

Law firms use RAG to search case law, contracts, and regulations. Answers with citations ensure accuracy and traceability.

5x faster legal research with verifiable sources

🏥

Healthcare

Medical professionals use RAG to query research papers, treatment guidelines, and patient histories while maintaining privacy.

HIPAA-compliant, no training data leakage

💬

Customer Support

Support teams use RAG to answer customer questions from product docs, FAQs, and past tickets—accurately and consistently.

70% of questions answered automatically

Coming Soon

This chapter will be your complete guide to building production-ready RAG systems. We'll cover everything from fundamental concepts to advanced techniques, with practical code examples, cost analyses, and deployment strategies.

By the end, you'll understand not just how RAG works, but why design decisions matter, when to use different approaches, and how to debug issues when they arise.

✨ Interactive examples

📊 Real cost calculations

💻 Complete code samples

🎯 Production best practices

← Chapter 10 All Chapters