Back to Projects

PDF Question Answering System

ROLE

Personal Project

TECHNOLOGIES

Python, FastAPI, PostgreSQL, pgvector, Ollama, Docker, Redis, RabbitMQ

GITHUB

Brief

BookRAG is a production-ready Retrieval-Augmented Generation system for querying PDF documents using local AI models. The system runs entirely on-premises, ensuring complete data privacy without requiring any cloud services or API keys.

RAG systems combine the power of large language models with the ability to search through specific documents, enabling accurate question-answering based on document content. This approach is more reliable than asking an LLM questions directly, as it grounds responses in actual document text.

Implementation

Built a complete RAG pipeline with production-grade infrastructure:

Designed a FastAPI backend with endpoints for document upload, processing status, and natural language querying
Implemented document processing using LangChain to split PDFs into chunks and generate embeddings via Ollama
Configured PostgreSQL with pgvector extension for efficient vector similarity search using cosine similarity
Set up asynchronous processing with RabbitMQ workers to handle document ingestion in the background
Added Redis caching layer to improve response times for repeated queries
Containerized the entire stack using Docker Compose for easy deployment

Technical Details

The system follows a microservices architecture with clear separation of concerns:

Document Processing Pipeline

Upload: PDFs are uploaded via API endpoint and queued for processing
Extract: Background workers extract text and split it into manageable chunks
Embed: Each chunk is converted to vector embeddings using Ollama's nomic-embed-text model
Store: Embeddings are stored in PostgreSQL with pgvector for similarity search

Query Pipeline

Embed Query: User questions are converted to vector embeddings
Search: Most similar document chunks are retrieved using cosine similarity
Context Assembly: Retrieved chunks provide context for the LLM
Generate: Ollama's Llama 3.2 generates answers based on document context
Cache: Responses are cached in Redis for improved performance

Key Features

Asynchronous Processing

Document processing happens in background workers via RabbitMQ, allowing users to upload large PDFs without blocking the API. Status endpoints provide real-time updates on processing progress.

Vector Similarity Search

Uses PostgreSQL's pgvector extension for efficient similarity search across document embeddings, enabling fast retrieval of relevant content from large document collections.

Privacy-Focused Design

All processing happens locally using Ollama models. No data leaves the local environment, making it suitable for sensitive documents.

Technical Challenges

Addressed several challenges in building a production-ready RAG system:

Chunking Strategy: Implemented recursive text splitting with overlap to maintain context across chunk boundaries while keeping chunks within embedding model limits
Vector Search Optimization: Configured pgvector indexes and tuned similarity thresholds to balance retrieval accuracy with performance
Model Performance: Worked around cold start times for Ollama models by implementing health checks and warm-up procedures
System Coordination: Used Docker Compose with proper service dependencies and health checks to ensure reliable startup across multiple containers

Takeaways

This project provided hands-on experience with the complete RAG architecture, from document processing to response generation. I learned about vector databases and similarity search algorithms, how to design asynchronous processing systems, and the practical considerations of deploying LLM-based applications. The focus on local deployment highlighted the importance of system design choices when working with resource-intensive models.