LLM Text Evaluation and Prompt Engineering Project
As part of an ongoing LLM evaluation initiative with Scale AI, I worked on designing and labeling text-based datasets to improve the accuracy, coherence, and contextual reasoning of large language models. My tasks included writing and evaluating thousands of prompt–response pairs, classifying text by tone and intent, and assessing AI-generated outputs for factual consistency, creativity, and compliance with task instructions. I also participated in the quality assurance and scoring phase, where I rated model outputs against detailed rubrics and collaborated with cross-functional teams to identify linguistic biases and style inconsistencies. The project involved iterative testing cycles and required maintaining high annotation accuracy (>98%) according to internal benchmarks.