LLM Evaluator
As an LLM Evaluator at Populii, I manually assessed model responses for accuracy, coherence, relevance, and safety across various language tasks. I provided preference labels (RLHF) and contributed to evaluation rubrics designed to enhance LLM fine-tuning and reduce hallucinations. My efforts included detailed documentation of biases, unsafe outputs, and feedback for engineering optimization. • Evaluated model response quality for summarization and question answering. • Used LabelStudio and Google Sheets to monitor safety and document outputs. • Developed and iterated evaluation guidelines with Labelbox. • Collaborated with ML engineers to recommend prompt improvements.