MCP Tool Fine tuning
This project aims to improve LLM performance in MCP tool-calling tasks by using Reinforcement Learning with Verifiable Rewards (RLVR). It introduces a rubric-based reward system that provides detailed, multidimensional feedback for complex, multi-step reasoning. In this project, you will write a prompt that requires the use of a tool(s) to be fulfilled. You will then observe the trajectory the model uses to generate its response. Your goal is to rewrite the prompt until the model generates an incorrect response. Upon model failure, you will create a rubric that not only defines what an ideal response must contain but also the ideal trajectory the model must use to achieve that response. Your work will enhance the ability of cutting-edge LLMs to provide fitting and sophisticated answers to a diverse set of user prompts