Multilingual LLM Evaluation and Prompt Rating Project
Participated in a high-volume LLM prompt/response evaluation project involving English and Swahili datasets. Rated AI-generated responses based on helpfulness, accuracy, safety, bias, and ethical alignment. Tasks included ranking completions, rewriting prompts for better clarity, fine-tuning conversational responses, and flagging unsafe outputs. Contributed to reinforcement learning from human feedback (RLHF) pipelines used to align models with human values. Handled over 20,000 text pairs, with a 98% agreement rate in QA review. Ensured linguistic clarity, neutrality, and alignment with OpenAI’s safety and quality standards. Delivered consistent, high-quality feedback under tight deadlines while maintaining language sensitivity across multiple dialects.