AI Agent Evaluation Analyst
I work on Mindrift/Toloka projects such as Tendem, evaluating AI agents and LLM tools on complex, multi‑step client tasks. This includes coding and debugging in a remote virtual environment, building and testing function‑calling / API workflows, running web searches and business data analysis, and comparing LLM outputs for quality and correctness. I create and refine prompts and responses, classify and rate agent behavior, perform online research and document analysis, and pick up partially completed tasks from other evaluators to finish or correct their work. I follow detailed task‑approach, rejection, and handoff guidelines to decide when to request more information, reject or hand off tasks, and document my methods and results clearly for clients.