RLHF Response Evaluation – STEM & Mathematics (BlackBeard / Outlier)
Evaluated and ranked AI-generated responses for a large-scale RLHF training project (BlackBeard) on the Outlier platform, specializing in STEM domains including mathematics (calculus, linear algebra, discrete math), physics, electronics, and engineering. Assessed response pairs across five structured dimensions: Instruction Following, Truthfulness, Verbosity, Prompt Correctness, and Writing Style & Tone. Applied Likert-scale preference rankings with evidence-based written justifications, following strict editorial and naming conventions defined by the project's style guide. Evaluated multimodal tasks combining text and image prompts as part of the project's most recent update. Consistently audited LaTeX syntax accuracy, distinguishing model errors from platform rendering bugs per official guidelines.