Generalist, Mercor
In this role, I evaluated and rated the performance of Large Language Models (LLMs) by reviewing AI-generated responses. I was responsible for annotating strengths, weaknesses, and factual accuracy of model outputs while assessing tone, reasoning quality, and completeness. The task also involved fact-checking using reputable sources and tools to ensure data integrity. • Generated human evaluation data for RLHF processes. • Annotated specific elements including accuracy and tone. • Reviewed LLM responses to user queries. • Contributed to improving LLM alignment and quality.