RAG in Practice: From Writing an MVP Path to Production-Level Optimization

Summary: A hands-on guide to building a LangChain RAG pipeline from scratch, covering chunking strategies, hybrid retrieval, reranking, and vector database best practices.

01 Core Competencies

In the engineering implementation of RAG (Retrieval-Augmented Generation), we can't just stay at the "knowing the concepts" level. A qualified RAG engineer must possess the following capabilities:

End-to-End Pipeline: Ability to hand-code the complete flow: Document Loading -> Chunking -> Embedding -> Vector DB Storage -> Retrieval -> Prompt Assembly -> Model Generation.
Fine-Grained Chunking Strategies: Don't blindly split by character count. Master semantic chunking and Markdown header-based chunking.
Multi-Path Recall & Reranking: Understand why single vector retrieval isn't sufficient and how to introduce Rerank for better precision.
Hybrid Search: Combine Elasticsearch (BM25) keyword search with vector-based semantic search.
Vector DB Hands-On: Proficiency in Chroma or Milvus CRUD operations and index configuration.

02 Standard LangChain MVP Implementation

Chunking

# Strategy: Recursive character splitting with context preservation
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 500,       # ~500 characters per chunk
chunk_overlap = 50,     # 50-char overlap to prevent sentence truncation
separators = ["

", "
", ".", "!", "?"]  # Prioritize paragraph breaks
)
splits = text_splitter.split_documents(docs)

Embedding & Vector DB Storage

# Call OpenAI API to convert text into vectors [0.1, -0.2, ...]
vectorstore = Chroma.from_documents(
documents = splits,
embedding = OpenAIEmbeddings(),
persist_directory = "./chroma_db"  # Persistent storage
)

Retrieval

# Find the Top 3 most similar chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
question = "What is TSLA's net profit margin in 2025 Q4?"
retrieved_docs = retriever.invoke(question)

Prompt Assembly & Generation

llm = ChatOpenAI(model="gpt-3.5-turbo")
chain = prompt | llm

# Context assembled from retrieved docs
response = chain.invoke({
"question": question,
"context": "

".join([doc.page_content for doc in retrieved_docs])
})

print(response.content)

03 Advanced: What to Do When RAG Performance Is Poor?

This is the most common question in interviews and real-world projects. We typically optimize from three angles:

Optimize Chunking Strategy

Pain Point

Rigidly splitting at 500 characters can easily cut "2025 revenue:" into one chunk and the actual number "10 billion" into the next. Context loss during retrieval leads to LLM hallucinations.

Solutions

Semantic Chunking: Use embedding similarity between adjacent sentences. Keep coherent content together; only split when meaning shifts.
Markdown Header Chunking: Split by headings like # Financial Summary, ## 1.1 Revenue. Retrieved content carries metadata like Financial Summary > Revenue, greatly improving retrieval precision.

Multi-Path Recall & Reranking

Coarse Ranking: Vector retrieval returns Top 50 relevant chunks (fast but moderate precision).
Fine Ranking: Use a Cross-Encoder (e.g., bge-reranker) to score these 50 candidates precisely, selecting Top 5 for the LLM.

Result: Although it adds ~200ms latency, accuracy improves dramatically.

04 Vector Database (Chroma) Quick Reference

Create/Load a Collection

import chromadb
client = chromadb.PersistentClient(path="./db")
collection = client.get_or_create_collection(name="finance_reports")

Upsert (Update or Insert)

Important: You must specify unique ids, otherwise data will accumulate as duplicates.

collection.upsert(
documents=["Apple Q3 revenue increased...", "Tesla sales declined..."],
metadatas=[{"source": "report1.pdf"}, {"source": "report2.pdf"}],
ids=["doc1", "doc2"]
)

Query

results = collection.query(
query_texts=["How are Tesla's sales?"],
n_results=2
)

05 Deep Q&A: Engineering Pitfall Guide

Q1: How to Handle Tables in PDFs?

Loading tables directly with PyPDFLoader produces garbled text, destroying all semantics.

Solution: Use pdfplumber to extract table structures, preserving row/column relationships before chunking.

Q2: How to Evaluate RAG Performance?

The core metric is "Recall" (whether the Top 3 results contain the correct answer). Only deploy new code when recall improves (e.g., from 60% to 80%).

Q3: How to Solve "Context Fragmentation"?

For example, "Moutai revenue" appears at the end of Chunk A, while the number appears at the beginning of Chunk B.

Configuration: Use chunk_overlap. Set Chunk Size = 500, Overlap = 50~100. This way, Chunk B's beginning repeats Chunk A's ending, ensuring key information (subject + number) appears completely in at least one chunk.

Q4: The Pipeline Is So Long — How to Optimize Latency?

If users wait 10 seconds, the experience collapses.

Three-Layer Optimization:

Experience Layer: End-to-end streaming output (Streaming/SSE).
Architecture Layer: Run vector retrieval and BM25 in parallel; after Rerank, only send Top 3 to the LLM (reducing input tokens).
Fallback Layer: Introduce Redis semantic caching. If a question has been asked before, return the cached answer directly — latency drops to just 0.1s.