Data Engineer for Instruct Dataset Curation (High-Quality Hindi & Hinglish Instruct Datasets Project)
Curated and engineered high-quality instructional datasets in Hindi and Hinglish for language model training. The work consisted of collecting, organizing, and vetting text data to ensure suitability for LLM fine-tuning and benchmarking. Contributed directly to the improvement of multilingual and low-resource language AI models. • Assembled a 30,000-entry Hindi instruct dataset curated for LLM tasks. • Developed HINGLISH-LIMA to address code-mixed language use cases. • Ensured instruction-following data was properly formatted and diversified. • Collaborated with engineering to validate data prior to integration.