Document Chunking & Summarization — Domain-Specific RAG Pipeline
I developed a retrieval-augmented generation (RAG) pipeline to process and label academic PDFs by segmenting documents into overlapping, information-rich chunks. This included extracting text, designing labeling rules for keyword coverage, and benchmarking summarization performance. Embedding-based search and chunk selection formed the core of the data labeling strategy. • Processed and annotated 594 pages from six academic PDFs, generating 695 labeled document chunks. • Defined chunk-level labeling criteria to ensure 93.3% keyword coverage against benchmarks. • Evaluated and compared summarization quality based on ROUGE-L and cosine similarity metrics. • Created an interactive local web interface for end-to-end labeled Q&A workflows.