For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
Danny Olami

Danny Olami

AI Quality Analyst - Conversational AI Evaluation

NIGERIA flag
lagos, Nigeria
$7.00/hrExpertAppenMercorScale AI

Key Skills

Software

AppenAppen
MercorMercor
Scale AIScale AI
Internal/Proprietary Tooling

Top Subject Matter

No subject matter listed

Top Data Types

AudioAudio
DocumentDocument
ImageImage
TextText
VideoVideo

Top Label Types

Classification
Emotion Recognition
Evaluation Rating
Fine Tuning
Prompt Response Writing SFT
Question Answering
Red Teaming
RLHF
Text Generation
Text Summarization
Translation Localization

Freelancer Overview

I am an AI quality analyst and prompt engineer with extensive experience in data labeling, annotation, and evaluation for large language models. My background includes evaluating over 500 AI outputs across diverse domains such as STEM, philosophy, linguistics, and economics, consistently maintaining a 97%+ quality score. I specialize in designing multi-turn adversarial prompts, conducting nuanced side-by-side comparisons, and identifying subtle issues like hallucinations and calibration errors. My expertise spans RLHF frameworks, personalization and grounding assessments, and delivering clear, defensible rationales for my decisions. With strong technical skills in Python and experience using leading data annotation platforms, I bring meticulous attention to detail, structured analytical thinking, and a proven ability to communicate complex insights effectively in both technical and non-technical contexts.

ExpertEnglishChinese MandarinArabic

Labeling Experience

AI Personalization Quality Evaluation (Gemini Features)

Don T DiscloseTextQuestion AnsweringRLHF
Evaluated AI personalization features for Gemini, assessing how effectively models utilize user context (Gmail, search history, past conversations) to generate relevant, helpful responses. Evaluation dimensions: - Grounding: Verified claims about users are supported by evidence, not hallucinations - Integration: Assessed whether personal data is woven naturally vs. robotically inserted - Helpfulness: Determined if personalization actually improves response quality - Appropriateness: Identified forced connections and overnarrating - Context usage: Evaluated proper utilization of multi-source personal data Conducted systematic side-by-side comparisons with detailed written rationales explaining quality differences. Maintained high inter-rater reliability through consistent evaluation standards.

Evaluated AI personalization features for Gemini, assessing how effectively models utilize user context (Gmail, search history, past conversations) to generate relevant, helpful responses. Evaluation dimensions: - Grounding: Verified claims about users are supported by evidence, not hallucinations - Integration: Assessed whether personal data is woven naturally vs. robotically inserted - Helpfulness: Determined if personalization actually improves response quality - Appropriateness: Identified forced connections and overnarrating - Context usage: Evaluated proper utilization of multi-source personal data Conducted systematic side-by-side comparisons with detailed written rationales explaining quality differences. Maintained high inter-rater reliability through consistent evaluation standards.

2025

Adversarial Prompt Design for LLM Quality Assessment

Internal Proprietary ToolingTextEvaluation RatingRed Teaming
Designed and executed multi-turn conversational prompts to test LLM capabilities across 8+ reasoning domains. Created adversarial test cases specifically designed to expose failure modes, calibration issues, and reasoning errors. Key contributions: - Developed 200+ adversarial prompts targeting edge cases in logic, mathematics, ethics, and linguistics - Tested model performance on ambiguous queries, counterfactual reasoning, and cross-domain synthesis - Identified systematic failure patterns across multiple frontier models - Documented grounding issues, overconfidence, and forced personalization problems - Contributed to improving model safety through systematic red teaming Applied interdisciplinary knowledge (CS, Philosophy, Linguistics) to create sophisticated test scenarios that revealed subtle reasoning failures.

Designed and executed multi-turn conversational prompts to test LLM capabilities across 8+ reasoning domains. Created adversarial test cases specifically designed to expose failure modes, calibration issues, and reasoning errors. Key contributions: - Developed 200+ adversarial prompts targeting edge cases in logic, mathematics, ethics, and linguistics - Tested model performance on ambiguous queries, counterfactual reasoning, and cross-domain synthesis - Identified systematic failure patterns across multiple frontier models - Documented grounding issues, overconfidence, and forced personalization problems - Contributed to improving model safety through systematic red teaming Applied interdisciplinary knowledge (CS, Philosophy, Linguistics) to create sophisticated test scenarios that revealed subtle reasoning failures.

2024
Appen

AI Output Evaluation for Frontier LLM Training (SFT/RLHF)

AppenTextRLHFFine Tuning
Evaluated 500+ AI outputs from frontier language models (GPT-4, Claude, Gemini) for supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) pipelines. Key responsibilities: - Designed adversarial prompts across STEM, philosophy, linguistics, and economics to test model reasoning boundaries - Conducted side-by-side (SxS) comparisons of model responses, rating for helpfulness, accuracy, and overall quality - Identified logical inconsistencies, hallucinations, grounding issues, and calibration errors - Wrote detailed rationales (200-500 words) explaining ranking decisions with explicit evidence - Assessed personalization quality: integration naturalness, context appropriateness, forced connections - Maintained 97%+ quality score across all platforms through rigorous evaluation standards Specialized in cross-domain reasoning evaluation requiring interdisciplinary expertise in CS, philosophy, and linguistics.

Evaluated 500+ AI outputs from frontier language models (GPT-4, Claude, Gemini) for supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) pipelines. Key responsibilities: - Designed adversarial prompts across STEM, philosophy, linguistics, and economics to test model reasoning boundaries - Conducted side-by-side (SxS) comparisons of model responses, rating for helpfulness, accuracy, and overall quality - Identified logical inconsistencies, hallucinations, grounding issues, and calibration errors - Wrote detailed rationales (200-500 words) explaining ranking decisions with explicit evidence - Assessed personalization quality: integration naturalness, context appropriateness, forced connections - Maintained 97%+ quality score across all platforms through rigorous evaluation standards Specialized in cross-domain reasoning evaluation requiring interdisciplinary expertise in CS, philosophy, and linguistics.

2024 - 2025

Education

H

Hugging Face

Certificate, Fundamentals of Large Language Models

Certificate
2026 - 2026
A

AI Certs

Certificate, Prompt Engineering

Certificate
2024 - 2024

Work History

S

Self-Employed

Freelance Technical Writer

Lagos
2023 - Present
A

Academic Research Collaboration

Research Contributor

Lagos
2022 - 2024