Agentic LLM Evaluation for App Generation
I evaluated LLMs in multi-step application generation tasks, where models built full-stack applications through iterative prompting. I guided the model across multiple turns, testing how well it could build and improve an app over time. I checked the generated code for correctness, structure, and real-world usability. I also created test cases to validate functionality and identified issues in logic, design, and implementation.