Supervised Fine-Tuning (SFT) Evaluation of Multi-Model LLM Code Generation Pipelines
Designed and evaluated a multi-agent LLM development system to improve model performance across full-stack engineering tasks. Curated high-quality supervised examples (gold standard outputs) and used them to guide SFT-style evaluation workflows. Generated and compared outputs from open-weight models (Qwen, Kimi, GLM, Gemma) and proprietary models (Claude, GPT, Gemini), evaluating them across correctness, architectural integrity, type safety, and cross-service integration. Performed ranking, correction, and iterative refinement, identifying common failure patterns such as broken async flows, missing edge-case handling, and invalid API contracts. Developed reusable prompting frameworks and evaluation rubrics that significantly improved output consistency and reduced token usage while increasing quality.