LLM Code Evaluation & RLHF Annotation for Software Engineering Tasks (Project Marlin V3)
Worked as an expert contributor on Project Marlin V3, a Reinforcement Learning from Human Feedback (RLHF) initiative focused on improving large language models for software engineering tasks. In this project, I designed complex, real-world coding prompts based on actual GitHub pull requests across multiple programming languages (Python, JavaScript/TypeScript, Go, Rust, Java, and C++). I evaluated and compared multiple AI-generated solutions (model trajectories) by analyzing code correctness, test coverage, maintainability, and engineering quality. The workflow involved executing AI models using the claude-hfi CLI tool, reviewing generated diffs, running tests, and providing structured, evidence-based feedback to determine model performance. My role required deep understanding of software engineering principles, debugging, refactoring, and system design. The evaluation outputs contributed directly to training and improving AI models using human preference data, ensuring higher-quality code generation and better real-world performance. Tools used included claude-hfi CLI, VS Code, Git, Snorkel Expert Platform, and Python-based utilities for validation and submission.