Summary: A hands-on guide to building a LangChain RAG pipeline from scratch, covering chunking strategies, hybrid retrieval, reranking, and vector database best practices.
01 Core Competencies
In the engineering implementation of RAG (Retrieval-Augmented Generation), we can't just stay at the "knowing the concepts" level. A qualified RAG engineer must possess the following capabilities:
- End-to-End Pipeline: Ability to hand-code the complete flow: Document Loading -> Chunking -> Embedding -> Vector DB Storage -> Retrieval -> Prompt Assembly -> Model Generation.
- Fine-Grained Chunking Strategies: Don't blindly split by character count. Master semantic chunking and Markdown header-based chunking.
- Multi-Path Recall & Reranking: Understand why single vector retrieval isn't sufficient and how to introduce Rerank for better precision.
- Hybrid Search: Combine Elasticsearch (BM25) keyword search with vector-based semantic search.
- Vector DB Hands-On: Proficiency in Chroma or Milvus CRUD operations and index configuration.
02 Standard LangChain MVP Implementation
Chunking
# Strategy: Recursive character splitting with context preservation
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 500, # ~500 characters per chunk
chunk_overlap = 50, # 50-char overlap to prevent sentence truncation
separators = ["
", "
", ".", "!", "?"] # Prioritize paragraph breaks
)
splits = text_splitter.split_documents(docs)
Embedding & Vector DB Storage
# Call OpenAI API to convert text into vectors [0.1, -0.2, ...]
vectorstore = Chroma.from_documents(
documents = splits,
embedding = OpenAIEmbeddings(),
persist_directory = "./chroma_db" # Persistent storage
)
Retrieval
# Find the Top 3 most similar chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
question = "What is TSLA's net profit margin in 2025 Q4?"
retrieved_docs = retriever.invoke(question)
Prompt Assembly & Generation
llm = ChatOpenAI(model="gpt-3.5-turbo")
chain = prompt | llm
# Context assembled from retrieved docs
response = chain.invoke({
"question": question,
"context": "
".join([doc.page_content for doc in retrieved_docs])
})
print(response.content)
03 Advanced: What to Do When RAG Performance Is Poor?
This is the most common question in interviews and real-world projects. We typically optimize from three angles:
- Optimize Chunking Strategy
Pain Point
Rigidly splitting at 500 characters can easily cut "2025 revenue:" into one chunk and the actual number "10 billion" into the next. Context loss during retrieval leads to LLM hallucinations.
Solutions
- Semantic Chunking: Use embedding similarity between adjacent sentences. Keep coherent content together; only split when meaning shifts.
- Markdown Header Chunking: Split by headings like # Financial Summary, ## 1.1 Revenue. Retrieved content carries metadata like Financial Summary > Revenue, greatly improving retrieval precision.
- Multi-Path Recall & Reranking
- Coarse Ranking: Vector retrieval returns Top 50 relevant chunks (fast but moderate precision).
- Fine Ranking: Use a Cross-Encoder (e.g., bge-reranker) to score these 50 candidates precisely, selecting Top 5 for the LLM.
Result: Although it adds ~200ms latency, accuracy improves dramatically.
04 Vector Database (Chroma) Quick Reference
- Create/Load a Collection
import chromadb
client = chromadb.PersistentClient(path="./db")
collection = client.get_or_create_collection(name="finance_reports")
- Upsert (Update or Insert)
Important: You must specify unique ids, otherwise data will accumulate as duplicates.
collection.upsert(
documents=["Apple Q3 revenue increased...", "Tesla sales declined..."],
metadatas=[{"source": "report1.pdf"}, {"source": "report2.pdf"}],
ids=["doc1", "doc2"]
)
- Query
results = collection.query(
query_texts=["How are Tesla's sales?"],
n_results=2
)
05 Deep Q&A: Engineering Pitfall Guide
Q1: How to Handle Tables in PDFs?
Loading tables directly with PyPDFLoader produces garbled text, destroying all semantics.
Solution: Use pdfplumber to extract table structures, preserving row/column relationships before chunking.
Q2: How to Evaluate RAG Performance?
The core metric is "Recall" (whether the Top 3 results contain the correct answer). Only deploy new code when recall improves (e.g., from 60% to 80%).
Q3: How to Solve "Context Fragmentation"?
For example, "Moutai revenue" appears at the end of Chunk A, while the number appears at the beginning of Chunk B.
Configuration: Use chunk_overlap. Set Chunk Size = 500, Overlap = 50~100. This way, Chunk B's beginning repeats Chunk A's ending, ensuring key information (subject + number) appears completely in at least one chunk.
Q4: The Pipeline Is So Long — How to Optimize Latency?
If users wait 10 seconds, the experience collapses.
Three-Layer Optimization:
- Experience Layer: End-to-end streaming output (Streaming/SSE).
- Architecture Layer: Run vector retrieval and BM25 in parallel; after Rerank, only send Top 3 to the LLM (reducing input tokens).
- Fallback Layer: Introduce Redis semantic caching. If a question has been asked before, return the cached answer directly — latency drops to just 0.1s.