Custom LLM (GPT-style) Training from Raw Data
Independently trained a GPT-like language model using raw Indian textual datasets. Developed and managed tokenization, data preprocessing, and transformer architecture optimization for effective AI training. Focused on preparing datasets for large language model training with attention to data quality and consistency. • Handled the end-to-end process from raw data collection to processed training-ready inputs. • Built custom preprocessing scripts for language data normalization and cleaning. • Ensured high-quality annotation for supervised learning workflows. • Executed iterative quality checks and validation of generated labels.