PDF Question Answering System
Brief
BookRAG is a production-ready Retrieval-Augmented Generation system for querying PDF documents using local AI models. The system runs entirely on-premises, ensuring complete data privacy without requiring any cloud services or API keys.
RAG systems combine the power of large language models with the ability to search through specific documents, enabling accurate question-answering based on document content. This approach is more reliable than asking an LLM questions directly, as it grounds responses in actual document text.
Implementation
Built a complete RAG pipeline with production-grade infrastructure:
- Designed a FastAPI backend with endpoints for document upload, processing status, and natural language querying
- Implemented document processing using LangChain to split PDFs into chunks and generate embeddings via Ollama
- Configured PostgreSQL with pgvector extension for efficient vector similarity search using cosine similarity
- Set up asynchronous processing with RabbitMQ workers to handle document ingestion in the background
- Added Redis caching layer to improve response times for repeated queries
- Containerized the entire stack using Docker Compose for easy deployment
Technical Details
The system follows a microservices architecture with clear separation of concerns:
Document Processing Pipeline
- Upload: PDFs are uploaded via API endpoint and queued for processing
- Extract: Background workers extract text and split it into manageable chunks
- Embed: Each chunk is converted to vector embeddings using Ollama's nomic-embed-text model
- Store: Embeddings are stored in PostgreSQL with pgvector for similarity search
Query Pipeline
- Embed Query: User questions are converted to vector embeddings
- Search: Most similar document chunks are retrieved using cosine similarity
- Context Assembly: Retrieved chunks provide context for the LLM
- Generate: Ollama's Llama 3.2 generates answers based on document context
- Cache: Responses are cached in Redis for improved performance
Key Features
Asynchronous Processing
Document processing happens in background workers via RabbitMQ, allowing users to upload large PDFs without blocking the API. Status endpoints provide real-time updates on processing progress.
Vector Similarity Search
Uses PostgreSQL's pgvector extension for efficient similarity search across document embeddings, enabling fast retrieval of relevant content from large document collections.
Privacy-Focused Design
All processing happens locally using Ollama models. No data leaves the local environment, making it suitable for sensitive documents.
Technical Challenges
Addressed several challenges in building a production-ready RAG system:
- Chunking Strategy: Implemented recursive text splitting with overlap to maintain context across chunk boundaries while keeping chunks within embedding model limits
- Vector Search Optimization: Configured pgvector indexes and tuned similarity thresholds to balance retrieval accuracy with performance
- Model Performance: Worked around cold start times for Ollama models by implementing health checks and warm-up procedures
- System Coordination: Used Docker Compose with proper service dependencies and health checks to ensure reliable startup across multiple containers
Takeaways
This project provided hands-on experience with the complete RAG architecture, from document processing to response generation. I learned about vector databases and similarity search algorithms, how to design asynchronous processing systems, and the practical considerations of deploying LLM-based applications. The focus on local deployment highlighted the importance of system design choices when working with resource-intensive models.