Back to Projects

PDF Question Answering System

ROLE

Personal Project

TECHNOLOGIES

Python, FastAPI, PostgreSQL, pgvector, Ollama, Docker, Redis, RabbitMQ

Brief

BookRAG is a production-ready Retrieval-Augmented Generation system for querying PDF documents using local AI models. The system runs entirely on-premises, ensuring complete data privacy without requiring any cloud services or API keys.

RAG systems combine the power of large language models with the ability to search through specific documents, enabling accurate question-answering based on document content. This approach is more reliable than asking an LLM questions directly, as it grounds responses in actual document text.

Implementation

Built a complete RAG pipeline with production-grade infrastructure:

  • Designed a FastAPI backend with endpoints for document upload, processing status, and natural language querying
  • Implemented document processing using LangChain to split PDFs into chunks and generate embeddings via Ollama
  • Configured PostgreSQL with pgvector extension for efficient vector similarity search using cosine similarity
  • Set up asynchronous processing with RabbitMQ workers to handle document ingestion in the background
  • Added Redis caching layer to improve response times for repeated queries
  • Containerized the entire stack using Docker Compose for easy deployment

Technical Details

The system follows a microservices architecture with clear separation of concerns:

Document Processing Pipeline

  1. Upload: PDFs are uploaded via API endpoint and queued for processing
  2. Extract: Background workers extract text and split it into manageable chunks
  3. Embed: Each chunk is converted to vector embeddings using Ollama's nomic-embed-text model
  4. Store: Embeddings are stored in PostgreSQL with pgvector for similarity search

Query Pipeline

  1. Embed Query: User questions are converted to vector embeddings
  2. Search: Most similar document chunks are retrieved using cosine similarity
  3. Context Assembly: Retrieved chunks provide context for the LLM
  4. Generate: Ollama's Llama 3.2 generates answers based on document context
  5. Cache: Responses are cached in Redis for improved performance

Key Features

Asynchronous Processing

Document processing happens in background workers via RabbitMQ, allowing users to upload large PDFs without blocking the API. Status endpoints provide real-time updates on processing progress.

Vector Similarity Search

Uses PostgreSQL's pgvector extension for efficient similarity search across document embeddings, enabling fast retrieval of relevant content from large document collections.

Privacy-Focused Design

All processing happens locally using Ollama models. No data leaves the local environment, making it suitable for sensitive documents.

Technical Challenges

Addressed several challenges in building a production-ready RAG system:

  • Chunking Strategy: Implemented recursive text splitting with overlap to maintain context across chunk boundaries while keeping chunks within embedding model limits
  • Vector Search Optimization: Configured pgvector indexes and tuned similarity thresholds to balance retrieval accuracy with performance
  • Model Performance: Worked around cold start times for Ollama models by implementing health checks and warm-up procedures
  • System Coordination: Used Docker Compose with proper service dependencies and health checks to ensure reliable startup across multiple containers

Takeaways

This project provided hands-on experience with the complete RAG architecture, from document processing to response generation. I learned about vector databases and similarity search algorithms, how to design asynchronous processing systems, and the practical considerations of deploying LLM-based applications. The focus on local deployment highlighted the importance of system design choices when working with resource-intensive models.