LLM Scientific Reasoning Evaluator – Project Phoenix (Outlier AI)
In Project Phoenix, I evaluate advanced LLM outputs across complex biological and biomedical topics. My tasks include designing expert-level scientific prompts and writing a 30-criterion evaluation rubric used to compare and judge responses from two different LLMs. I assess factual accuracy, reasoning quality, chain-of-thought coherence, and compliance with scientific standards. I document error patterns, identify model weaknesses, and provide high-quality annotations that improve both safety and accuracy in life-science-oriented AI systems. This project requires domain expertise in molecular biology, immunology, neuroscience, oncology, and experimental research.