For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
M
Muhammad Sulleman

Muhammad Sulleman

Founder & Lead Engineer — Data Labeling, QA & LLM Output Evaluation, MetaOrcha

Pakistan flagIslamabad, Pakistan
$30.00/hrExpertAws SagemakerHivemindMercor

Key Skills

Software

AWS SageMakerAWS SageMaker
HiveMindHiveMind
MercorMercor
RemotasksRemotasks

Top Subject Matter

LLM agentic orchestration
behavioral testing
output evaluation

Top Data Types

TextText
ImageImage
DocumentDocument

Top Task Types

ClassificationClassification
Question AnsweringQuestion Answering
Text SummarizationText Summarization
RLHFRLHF
Fine-tuningFine-tuning
Red TeamingRed Teaming
TranscriptionTranscription
Evaluation/RatingEvaluation/Rating
Computer Programming/CodingComputer Programming/Coding
Function CallingFunction Calling
Prompt + Response Writing (SFT)Prompt + Response Writing (SFT)
Text GenerationText Generation

Freelancer Overview

Founder & Lead Engineer — Data Labeling, QA & LLM Output Evaluation, MetaOrcha. Brings 8+ years of professional experience across complex professional workflows, research, and quality-focused execution. Core strengths include Internal and Proprietary Tooling. Education includes Bachelor of Engineering, National University of Sciences and Technology (2022). AI-training focus includes data types such as Text and labeling workflows including Evaluation, Rating, and Classification.

ExpertEnglishRussianUrdu

Labeling Experience

Founder & Lead Engineer — Data Labeling, QA & LLM Output Evaluation, MetaOrcha

Text
Led systematic AI failure-mode analysis and behavioral validation for agent outputs on a production-grade LLM orchestration platform. Developed and enforced labelling-equivalent guidelines, correctness criteria, and ambiguity escalation paths for output review. Benchmarked functional equivalence across LLM providers and stress-tested third-party agent network outputs for quality benchmarking. • Defined annotation-style acceptance criteria for agentic outputs. • Crafted adversarial test suites targeting model behaviors and failure cases. • Built processes for edge-case enumeration, inter-annotator agreement, and output ambiguity handling. • Used internal/proprietary tooling with evaluation rubrics for ongoing quality assurance.

Led systematic AI failure-mode analysis and behavioral validation for agent outputs on a production-grade LLM orchestration platform. Developed and enforced labelling-equivalent guidelines, correctness criteria, and ambiguity escalation paths for output review. Benchmarked functional equivalence across LLM providers and stress-tested third-party agent network outputs for quality benchmarking. • Defined annotation-style acceptance criteria for agentic outputs. • Crafted adversarial test suites targeting model behaviors and failure cases. • Built processes for edge-case enumeration, inter-annotator agreement, and output ambiguity handling. • Used internal/proprietary tooling with evaluation rubrics for ongoing quality assurance.

2025 - Present

Co-founder & CTO — Annotation-Centric QA & Expected-vs-Actual Output Labeling, Crashx

Text
Created and implemented annotation-style acceptance criteria for education-tech platform features, focusing on defining expected, passing, or regression outputs. Generated labeled logs of expected-vs-actual outputs for student responses under adversarial exam conditions, ensuring reproducibility and ongoing QA. Ensured annotation guidelines were clearly communicated and scalable to 120+ student user base. • Designed reproducible annotation guides for student-facing feature QA. • Produced labeled datasets capturing edge cases in educational assessment. • Established criteria to escalate ambiguous labeling outcomes. • Leveraged internal/proprietary tools for log labeling and QA review.

Created and implemented annotation-style acceptance criteria for education-tech platform features, focusing on defining expected, passing, or regression outputs. Generated labeled logs of expected-vs-actual outputs for student responses under adversarial exam conditions, ensuring reproducibility and ongoing QA. Ensured annotation guidelines were clearly communicated and scalable to 120+ student user base. • Designed reproducible annotation guides for student-facing feature QA. • Produced labeled datasets capturing edge cases in educational assessment. • Established criteria to escalate ambiguous labeling outcomes. • Leveraged internal/proprietary tools for log labeling and QA review.

2024 - Present

Engineer — Ground-Truth Label Definition & Decision Guideline Creation, WebPuls.ai

TextClassification
Defined and documented ground-truth labels to distinguish 'meaningful vs. spurious change' in web page content for production content-detection logic. Tuned label guidelines for precision and recall trade-offs and handled edge-case decision criteria at a large scale. Developed quality-gating and noise-filtering measures mirrored from data labeling best practices. • Created robust labeling rubrics for change classification in noisy data. • Scaled annotation processes for thousands of web pages. • Documented and updated decision guidelines for ambiguous content shifts. • Used internal/proprietary detection logic and evaluation tools.

Defined and documented ground-truth labels to distinguish 'meaningful vs. spurious change' in web page content for production content-detection logic. Tuned label guidelines for precision and recall trade-offs and handled edge-case decision criteria at a large scale. Developed quality-gating and noise-filtering measures mirrored from data labeling best practices. • Created robust labeling rubrics for change classification in noisy data. • Scaled annotation processes for thousands of web pages. • Documented and updated decision guidelines for ambiguous content shifts. • Used internal/proprietary detection logic and evaluation tools.

2024 - 2024

Intern — Validation Test Annotation & Edge-Case Labeling, PQC Labs

Text
Developed and executed validation tests to ensure protocol correctness, entropy preservation, and adversarial recovery in cryptographic data scenarios. Provided systematic edge-case coverage directly applicable to AI training data validation. Engaged in detailed error scenario labeling, ambiguity resolution, and comprehensive review of protocol outcomes. • Validated and labeled protocol outputs in adversarial settings. • Created annotation criteria for entropy and recovery benchmarks. • Conducted edge-case enumeration and protocol ambiguity handling. • Relied on internal/proprietary test and validation frameworks.

Developed and executed validation tests to ensure protocol correctness, entropy preservation, and adversarial recovery in cryptographic data scenarios. Provided systematic edge-case coverage directly applicable to AI training data validation. Engaged in detailed error scenario labeling, ambiguity resolution, and comprehensive review of protocol outcomes. • Validated and labeled protocol outputs in adversarial settings. • Created annotation criteria for entropy and recovery benchmarks. • Conducted edge-case enumeration and protocol ambiguity handling. • Relied on internal/proprietary test and validation frameworks.

2023 - 2023

Education

N

National University of Sciences and Technology

Bachelor of Engineering, Software Engineering

Bachelor of Engineering
2022

Work History

M

MetaOrcha

Founder & Lead Engineer

Islamabad
2025 - Present
C

Crashx

Co-founder & CTO

Islamabad
2024 - Present