Data Science Fellow—RAG Dataset Preparation & EDA
As a Data Science Fellow, performed EDA and data cleaning on arXiv metadata to prepare a high-quality corpus for an NLP retrieval-augmented generation (RAG) pipeline. Engineered NLP features and created a documented data schema, supporting scalable chunking and embedding for LLM training. Executed ETL workflows and feature engineering pipelines aimed at enhancing the accuracy and utility of AI models. • Prepared and curated text data for LLM-focused RAG datasets • Implemented NLP feature engineering for downstream AI tasks • Created data schemas to ensure scalability and ingestibility • Conducted rigorous quality checks to validate annotation and preparation steps