Domain-Specific Benchmark Dataset for Scientific AI Reasoning
Led the design and creation of a high quality, domain specific dataset to benchmark the reasoning, accuracy, and reliability of generative AI models in scientific domains (e.g., computational biology, physics). Scope & Tasks: The project involved defining annotation schemas for complex scientific reasoning, curating source materials from academic papers, and overseeing a rigorous data labeling pipeline. Specific tasks included: Crafting and labeling complex, multi step questions requiring logical and mathematical reasoning. Annotating "chain of thought" reasoning paths with step by step validity checks. Implementing a multi-stage review process where domain experts (PhD-level) validated labels and annotations created by junior researchers and annotators. Designing and applying a detailed rubric for human in the loop evaluation of model outputs, scoring for factual accuracy, logical soundness, and potential hallucinations.