LLM Output Comparative Evaluation — AI Model Assessor
I conducted structured output comparisons of LLMs using diverse prompt architectures to evaluate scientific reasoning depth, factual accuracy, and summary quality. The workflow included side-by-side evaluations of outputs from Claude, ChatGPT, and Gemini based on identical research tasks in genomics and diagnostics. I documented specific improvements and discrepancies resulting from different annotation strategies such as Chain-of-Thought and zero-shot prompting. • Ran comparative LLM evaluations on genomics and clinical lab data tasks. • Measured structure, depth, and specificity improvements using annotated prompt experiments. • Provided quantitative and qualitative feedback on LLM performance in scientific and medical domains. • Generated detailed reports for multi-model assessment of scientific output quality.