Data Labeling for Large Language Model (LLM) Training
I worked on a data labeling and curation project for training a large-scale language model (LLM). My tasks included: Annotating and classifying named entities (NER) in large text datasets in both English and Spanish. Evaluating and refining AI-generated responses to improve model accuracy and coherence. Generating and curating question-answer datasets to enhance the model’s conversational capabilities. Implementing rigorous quality control measures, including double review and consistency checks. The project involved annotating over 100,000 text samples, ensuring high-quality data representative of diverse linguistic contexts.