AI Model Evaluation
In this project, I was responsible for evaluating AI-generated responses in both English and Indonesian across various tasks, including chatbot conversations, AI safety classifications, and content appropriateness reviews. The project involved assessing thousands of model outputs based on specific quality measures such as accuracy, safety, relevance, tone alignment, and localization. I also contributed by providing peer evaluations, rating fellow trainers' assessments to maintain annotation consistency and data integrity. The dataset consisted of open-domain conversational prompts, classification tasks related to harmful or unsafe content, and multilingual text responses. This project required strong attention to detail, cultural sensitivity, and the ability to make quick, unbiased judgments within tight deadlines while ensuring adherence to project guidelines and quality benchmarks.