For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
Y
Yug Pratap Gupta

Yug Pratap Gupta

AI Model Evaluator & RLHF Specialist | LLM Prompt Engineer | AI Automation Engineer

India flagGurgaon, India
$30.00/hrIntermediateData Annotation TechGoogle Cloud Vertex AIMercor

Key Skills

Software

Data Annotation TechData Annotation Tech
Google Cloud Vertex AIGoogle Cloud Vertex AI
MercorMercor
MindriftMindrift
TelusTelus

Top Subject Matter

Artificial Intelligence & Machine Learning (LLM Evaluation, RLHF, Prompt Engineering, AI Training Data)
CustomerData Science & Analytics (Business Intelligence, Forecasting, ETL Pipelines, Statistical Modeling) Support Automation
Software Engineering & AI Automation (Agentic Systems, API Integration, Voice AI, Workflow Automation)

Top Data Types

AudioAudio
ImageImage
Computer Code ProgrammingComputer Code Programming

Top Task Types

TranscriptionTranscription
ClassificationClassification
RLHFRLHF
Computer Programming/CodingComputer Programming/Coding
Prompt + Response Writing (SFT)Prompt + Response Writing (SFT)

Freelancer Overview

AI Automation Engineer with 3+ years of experience building production-grade LLM-powered systems, agentic workflows, and AI evaluation pipelines. Currently contributing to RLHF training pipelines through Soul AI (Deccan AI Experts) and Outlier (Scale AI), applying structured 15-dimension evaluation frameworks across instruction following, truthfulness, code correctness, harmlessness, and comparative preference ranking. Experienced in writing evidence-based justifications that directly shape reward model training signals for large language models. Beyond annotation, I bring hands-on expertise in GPT-4o, Gemini APIs, RAG systems, prompt engineering, and end-to-end AI workflow automation across fintech and pharma domains. My work spans both the engineering side — building agentic AI systems like ARIA (multilingual voice agent) — and the evaluation side, making me uniquely positioned to assess AI outputs with both technical depth and quality rigor. Education: B.Sc. Computer Science Engineering (Big Data Analytics), Chandigarh University, 2024.

IntermediateEnglishHindiGerman

Labeling Experience

Title: Basic Tasking & Visual Instruction Following — Aether

ImageClassification
Completed short-form AI training microtasks including image description, visual difference spotting, and instruction-following evaluations. Tasks were designed to train models on basic perceptual and comprehension capabilities through high-volume, high-accuracy labeling. Applied coding and SQL-adjacent reasoning skills to structured classification and comparison tasks.

Completed short-form AI training microtasks including image description, visual difference spotting, and instruction-following evaluations. Tasks were designed to train models on basic perceptual and comprehension capabilities through high-volume, high-accuracy labeling. Applied coding and SQL-adjacent reasoning skills to structured classification and comparison tasks.

2026 - Present

Research-Based Prompt Engineering for Advanced Model Benchmarking — Almanac

TextPrompt Response Writing SFT
Contributed to Almanac, an Outlier (Scale AI) project focused on creating research-based prompts with verifiable, factually grounded answers designed to challenge and stress-test advanced frontier AI models. Core tasks included: Prompt Engineering & Research: Crafted complex, multi-step prompts across data science and analytical domains requiring deep reasoning, factual accuracy, and source-verifiable responses. Prompts were designed to expose reasoning gaps and capability limits in state-of-the-art LLMs — similar in spirit to adversarial benchmarking frameworks like Humanity's Last Exam (HLE). Response Evaluation: Assessed AI-generated responses against multiple quality dimensions including Completeness (how well the response followed instructions), Groundedness (accuracy based on provided context/documents), Truthfulness (accuracy against external verifiable sources), and Overall Response Quality — applying the same structured evaluation methodology used in RLHF pipelines from prior Soul AI / Deccan AI work. Comparative Ranking: Performed side-by-side comparison of AI response pairs for the same prompt, determining which response better satisfied user intent based on accuracy, reasoning depth, and alignment — and writing evidence-based justifications for each preference decision. Benchmark Contribution: Work directly fed into training data pipelines used to measure and improve frontier model performance on high-difficulty, domain-specific knowledge tasks — contributing to the broader effort of making LLMs more accurate, reliable, and resistant to hallucination in specialized domains.

Contributed to Almanac, an Outlier (Scale AI) project focused on creating research-based prompts with verifiable, factually grounded answers designed to challenge and stress-test advanced frontier AI models. Core tasks included: Prompt Engineering & Research: Crafted complex, multi-step prompts across data science and analytical domains requiring deep reasoning, factual accuracy, and source-verifiable responses. Prompts were designed to expose reasoning gaps and capability limits in state-of-the-art LLMs — similar in spirit to adversarial benchmarking frameworks like Humanity's Last Exam (HLE). Response Evaluation: Assessed AI-generated responses against multiple quality dimensions including Completeness (how well the response followed instructions), Groundedness (accuracy based on provided context/documents), Truthfulness (accuracy against external verifiable sources), and Overall Response Quality — applying the same structured evaluation methodology used in RLHF pipelines from prior Soul AI / Deccan AI work. Comparative Ranking: Performed side-by-side comparison of AI response pairs for the same prompt, determining which response better satisfied user intent based on accuracy, reasoning depth, and alignment — and writing evidence-based justifications for each preference decision. Benchmark Contribution: Work directly fed into training data pipelines used to measure and improve frontier model performance on high-difficulty, domain-specific knowledge tasks — contributing to the broader effort of making LLMs more accurate, reliable, and resistant to hallucination in specialized domains.

2026 - Present

Code Quality & Tool Use Evaluation — Outlier

Computer Code ProgrammingEvaluation Rating
Assessed AI-generated code responses for correctness of tool selection, function parameters, multi-tool sequencing logic, and final output accuracy. Evaluated code and output independently — identifying hallucinated results vs. genuine execution errors. Rated responses across Tool Use, Code Sequencing, and Code Output dimensions. Tasks included Python-based generation, API tool calls, and multi-step agentic code scenarios.

Assessed AI-generated code responses for correctness of tool selection, function parameters, multi-tool sequencing logic, and final output accuracy. Evaluated code and output independently — identifying hallucinated results vs. genuine execution errors. Rated responses across Tool Use, Code Sequencing, and Code Output dimensions. Tasks included Python-based generation, API tool calls, and multi-step agentic code scenarios.

2026 - Present

AI Voice Agent System Developer (ARIA Project)

AudioTranscription
I designed, developed, and deployed a multilingual AI voice support agent leveraging advanced transcription and intent recognition. The system labeled and processed audio conversations into text, then performed semantic search and LLM-based processing to classify intent and route inquiries. Automated workflows used labeled data for zero-touch tier-1 support intake and performance improvement through continuous feedback. • Labeled and transcribed multilingual voice data using ElevenLabs and Web Speech API • Utilized GPT-4o-mini for processing, classification, and intent identification from text • Managed semantic search and RAG using ChromaDB for documentation retrieval • Automated data labeling and routing pipeline using n8n and Python Flask

I designed, developed, and deployed a multilingual AI voice support agent leveraging advanced transcription and intent recognition. The system labeled and processed audio conversations into text, then performed semantic search and LLM-based processing to classify intent and route inquiries. Automated workflows used labeled data for zero-touch tier-1 support intake and performance improvement through continuous feedback. • Labeled and transcribed multilingual voice data using ElevenLabs and Web Speech API • Utilized GPT-4o-mini for processing, classification, and intent identification from text • Managed semantic search and RAG using ChromaDB for documentation retrieval • Automated data labeling and routing pipeline using n8n and Python Flask

2025 - Present

LLM Response Evaluation & RLHF — Soul AI / Deccan AI Experts

TextRLHF
Evaluated AI-generated response pairs across a structured 15-dimension rubric covering Instruction Following, Truthfulness, Harmlessness, Content Completeness, Relevance, Collaborativity, and Writing Style. Applied comparative preference ranking to determine which responses better satisfied user intent, then wrote evidence-based justifications to generate reward signals for LLM fine-tuning. Tasks spanned text, code, and multi-turn conversation scenarios across varying complexity levels.

Evaluated AI-generated response pairs across a structured 15-dimension rubric covering Instruction Following, Truthfulness, Harmlessness, Content Completeness, Relevance, Collaborativity, and Writing Style. Applied comparative preference ranking to determine which responses better satisfied user intent, then wrote evidence-based justifications to generate reward signals for LLM fine-tuning. Tasks spanned text, code, and multi-turn conversation scenarios across varying complexity levels.

2025 - 2025

Education

C

Chandigarh University

Bachelor of Science, Computer Science Engineering with Specialization in Big Data Analytics

Bachelor of Science
2020 - 2024

Work History

S

Satin Creditcare Network Limited

Assistant Manager – Business Intelligence Unit

Gurgaon
2025 - Present
N

Nectar Lifesciences Ltd.

Data Analyst

Chandigarh
2024 - 2025