Prompt Engineer / Python AI Evaluator
As a Prompt Engineer / Python AI Evaluator at Scale AI, I conducted evaluations to benchmark the performance of large language models (LLMs) on reasoning, coding, and language tasks. My responsibilities included ranking AI model responses, writing high-quality rationales, and creating structured annotations for model outputs. I was also responsible for leading SFT dataset creation, ensuring quality assurance and consistency. • Designed and implemented prompt engineering workflows for data evaluation. • Conducted RLHF and output ranking for LLMs. • Curated datasets for supervised fine-tuning (SFT) and evaluations. • Generated metacognitive rationales to improve model reasoning and performance.