LLM Failure Analyst
I evaluated outputs of Qwen-class language models using adversarially designed doctoral-level physics problems. I designed benchmarks to assess language model reasoning in complex scientific contexts. I developed and improved evaluation pipelines focused on model robustness and accuracy. • Crafted and administered evaluation sets for LLMs. • Identified failure points in LLM physics reasoning. • Implemented standard evaluation rubrics for consistency. • Benchmarked LLM outputs to identify areas for improvement.