Reinforcement Learning from Human Feedback: LLM Preference Annotation
My activity has included carrying out preference-based annotation jobs to support creating an RLHF pipeline to align a large language model with human expectations. My job involved comparing two AI-generated responses and choosing or ranking the better one based on helpfulness, harmlessness, and honesty criteria. Additionally, I provided elaborate justification for my preferred choice to give additional signal beyond just simple binary ratings, noted where neither of the two provided responses met quality standards and created reference responses to help when the model's outputs did not meet the standards set for helping with my task. I have developed a strong understanding of the RLHF-related failure modes of alignment, including sycophantic behaviour, over-refusal, and factually inaccurate confabulation across varying prompt types.