Evaluator
This project was designed to test the robustness of large language models by crafting prompts that could expose failure points across dimensions such as reasoning, safety, or compliance. The task also involved evaluating and rewriting model responses to meet ideal output standards. Labeling work included categorizing prompt intent, diagnosing response flaws, and revising text according to predefined quality criteria. The project generated and reviewed thousands of prompt-response sets, with quality ensured through peer review processes and alignment with high-precision annotation guidelines.