Jibran Khan - LLM Evaluation Specialist in Python & Java Programming

Key Skills

Software

Scale AI

Top Subject Matter

No subject matter listed

Top Data Types

Computer Code Programming

Image

Text

Top Task Types

Computer Programming Coding

Evaluation Rating

Prompt Response Writing SFT

RLHF

Translation Localization

Freelancer Overview

With a year of hands-on experience at Outlier.ai, I have specialized in training reasoning models for mathematics and, more recently, large language models (LLMs) focused on coding tasks. My work has spanned a diverse range of projects, including collaborating with TopCoder to transform complex algorithmic problems into accessible, story-driven formats, and developing assertion tests to ensure output quality and accuracy. Throughout these roles, I have consistently adapted to new project requirements by thoroughly studying onboarding materials and applying advanced techniques such as Reinforcement Learning from Human Feedback (RLHF), Supervised Fine-Tuning (SFT), response comparison, and evaluation. I hold platform certifications for and coding projects, having passed skill assessments in Java, Python, JavaScript, SQL, C++, Clojure, React, and R. My strongest technical skills lie in Python and its data science ecosystem, as well as the JVM stack through Java, Clojure, and Kotlin. Professionally, I excel at training LLMs using a variety of methodologies, quickly mastering new workflows, and delivering high-quality, reliable training data for AI systems.

IntermediateFrenchPashtoEnglish

Labeling Experience

Code Generation and Evaluation for LLM Training

Scale AIComputer Code ProgrammingClassificationRLHF

I was involved in training and evaluating LLMs for programming across 7 task types (e.g., code generation, code review, debugging, Test Case generation, documentation, and refactoring) and many different categories (including but not limited to front-end, back-end, data-engineering, data visualization, DB management, scripting, algorithms, etc..). Responsibilities included: Generating prompts tailored to specific task types and categories (e.g., Code generation make my x, or here's my code I have this error please debug). Performing domain classification to validate if prewritten prompts aligned with target use cases. Evaluating model responses on dimensions like correctness, clarity, efficiency, readability, and adherence to specifications, with criteria varying by category (e.g., database management would include query optimization).

2024

Education

U

University of Ontario Institute of Technology

Bachelor of Science, Computer Science

Bachelor of Science

2022

Work History

W

Wouessi

DevOps & Front-end Developer

Remote

2025 - 2025

W

Wennovate Consulting

Full-Stack Developer

Remote

2025 - 2025