LLM Prompt Evaluation & Safety Rating
Evaluated thousands of AI-generated responses for safety, helpfulness, coherence, tone, and factual accuracy using proprietary guidelines. Tasks included rating LLM outputs, identifying harmful or biased content, performing red teaming, and writing prompt-response pairs for fine-tuning. This project supported multilingual evaluation (English, Spanish) and required attention to linguistic nuance and ethical alignment. Delivered consistent, high-quality feedback with a QA approval rate exceeding 95%. Labeled over 500 hours of training and validation data in collaboration with global AI teams.