Pontius A vs B Evaluation
Worked on a large-scale A vs B response evaluation project for Large Language Models using the Pontius framework. The project focused on comparing paired AI-generated responses to the same prompt and selecting the better output based on predefined quality dimensions. Key responsibilities included: • Evaluating A/B model responses for relevance, correctness, completeness, reasoning quality, tone, and safety • Applying RLHF-aligned judgment criteria to rank outputs and provide preference signals • Identifying hallucinations, factual errors, bias, policy violations, and instruction-following issues • Performing fine-grained qualitative assessment rather than isolated scoring to ensure non-decomposable evaluation • Writing concise justifications explaining why one response outperformed the other Strict quality standards were followed, including guideline adherence, consistency checks, and reviewer calibration to ensure high inter-annotator agreement.