LLM Prompt Evaluation & Response Ranking
Evaluated AI-generated responses from large language models by comparing outputs against structured guidelines and ranking them based on relevance, accuracy, coherence, and instruction adherence. Performed pairwise comparisons and multi-response ranking tasks to improve model performance and alignment. Identified issues such as hallucinations, factual inaccuracies and poor instruction-following. Applied detailed evaluation rubrics to ensure consistency and high-quality feedback for model training and reinforcement learning workflows. Maintained accuracy and consistency across large volumes of evaluation tasks.