LLM Generalist DataAnnotation
Evaluate and improve large language models through structured prompt engineering, response analysis, and multi-turn conversational testing. Design evaluation rubrics and apply detailed quality standards to assess model reasoning, factual accuracy, instruction following, and safety compliance. Perform end-to-end task evaluations, including reviewing annotations produced by other contributors, identifying major model failures, and documenting edge cases that impact model reliability. Generate targeted prompts to stress-test model behavior and surface systematic weaknesses.