DOXA: RAG Pipeline for Document-Based QA Labeling & Extraction
Designed and implemented a RAG system to extract question-answer pairs from unstructured PDF documents using automated pipelines. Carried out document parsing, segmentation, and knowledge extraction methods to build a structured QA knowledge base. Developed context-preserving text chunking and vector search for efficient information retrieval. • Automated ingestion and parsing of multi-page agricultural and technical PDFs. • Labeled and organized extracted text into tailored chunked Q&A pairs for conversational AI queries. • Adopted FAISS for scalable vector-based document retrieval tasks. • Developed and tuned modular components for robust QA chain management.