For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
Felipe Lima

Felipe Lima

Medical Doctor & LLM Safety Evaluator (PT-EN-ES)

Brazil flagRussas, Brazil
$50.00/hrExpertAppenClickworkerCrowdsource

Key Skills

Software

AppenAppen
ClickworkerClickworker
CrowdSourceCrowdSource
Data Annotation TechData Annotation Tech
LionbridgeLionbridge
OneFormaOneForma
RemotasksRemotasks
Scale AIScale AI

Top Subject Matter

No subject matter listed

Top Data Types

DocumentDocument
Medical DicomMedical Dicom
TextText

Top Task Types

Classification
Evaluation Rating
Prompt Response Writing SFT
Text Summarization
Translation Localization

Freelancer Overview

Medical Doctor (M.D.) with strong clinical experience in Primary Care, Emergency Medicine, and telemedicine, combined with over four years of hands-on work in AI training, data labeling, and LLM evaluation. I specialize in safety alignment, clinical reasoning assessment, guideline-based fact-checking, hallucination detection, and multilingual text analysis (PT–EN–ES). Across platforms such as Outlier.ai, Remotasks/Scale AI, OneForma/Pactera EDGE and Clickworker, I have contributed to high-accuracy datasets for medical content evaluation, reasoning benchmarks, and general NLP tasks. My background as a physician allows me to review complex scenarios with precision, identify risks, and ensure safe and evidence-based outputs. I bring excellent analytical skills, clear communication, and the ability to work across diverse projects requiring detail, consistency, and sound judgment.

ExpertEnglishSpanishPortuguese

Labeling Experience

Scale AI

Clinical Reasoning Benchmarking for Medical AI Models

Scale AIMedical DicomQuestion AnsweringText Summarization
Conducted advanced medical evaluation of AI-generated clinical reasoning, diagnoses, differential diagnosis ranking, red-flag detection, and treatment recommendations. Reviewed complex multi-step medical cases to ensure accuracy, safety, guideline adherence, and absence of hallucinations. Assessed patient triage quality, emergency red flag recognition, and evidence-based therapeutic reasoning. Contributed to high-level medical benchmarking datasets used for training and validating medical AI systems across primary care, emergency medicine, pediatrics, mental health, gastroenterology, and internal medicine.

Conducted advanced medical evaluation of AI-generated clinical reasoning, diagnoses, differential diagnosis ranking, red-flag detection, and treatment recommendations. Reviewed complex multi-step medical cases to ensure accuracy, safety, guideline adherence, and absence of hallucinations. Assessed patient triage quality, emergency red flag recognition, and evidence-based therapeutic reasoning. Contributed to high-level medical benchmarking datasets used for training and validating medical AI systems across primary care, emergency medicine, pediatrics, mental health, gastroenterology, and internal medicine.

2024
Scale AI

LLM Safety & Medical Content Evaluation

Scale AITextQuestion AnsweringText Summarization
Evaluated large language model outputs for medical accuracy, diagnostic quality, safety risks, treatment appropriateness, and clinical reasoning depth. Performed rating, correction, and review of AI-generated answers related to primary care, emergency medicine, mental health, pediatrics, and internal medicine. Ensured guideline-aligned and safe medical reasoning.

Evaluated large language model outputs for medical accuracy, diagnostic quality, safety risks, treatment appropriateness, and clinical reasoning depth. Performed rating, correction, and review of AI-generated answers related to primary care, emergency medicine, mental health, pediatrics, and internal medicine. Ensured guideline-aligned and safe medical reasoning.

2023
Scale AI

General LLM Evaluation & Text Categorization

Scale AITextClassificationQuestion Answering
Worked on multiple LLM evaluation tasks, including classification, summarization checks, relevance scoring, bias detection, text quality rating, and multi-step reasoning audits. Labeled and evaluated model outputs to improve safety, coherence, and factual accuracy across general content tasks.

Worked on multiple LLM evaluation tasks, including classification, summarization checks, relevance scoring, bias detection, text quality rating, and multi-step reasoning audits. Labeled and evaluated model outputs to improve safety, coherence, and factual accuracy across general content tasks.

2021 - 2024
OneForma

Multilingual NLP Review & Linguistic Evaluation (PT-EN-ES)

OneformaTextQuestion AnsweringText Summarization
Performed multilingual evaluation and linguistic quality review in Portuguese, English, and Spanish. Included translation validation, semantic consistency checks, grammar review, cultural adaptation, and quality rating of multilingual datasets used to train NLP systems.

Performed multilingual evaluation and linguistic quality review in Portuguese, English, and Spanish. Included translation validation, semantic consistency checks, grammar review, cultural adaptation, and quality rating of multilingual datasets used to train NLP systems.

2020 - 2024
Clickworker

Search Evaluation & Content Classification

ClickworkerTextClassificationQuestion Answering
Reviewed search results for relevance, correctness, and intent alignment. Performed content classification, metadata tagging, quality rating, and analysis of large text sets to improve search algorithm performance.

Reviewed search results for relevance, correctness, and intent alignment. Performed content classification, metadata tagging, quality rating, and analysis of large text sets to improve search algorithm performance.

2021 - 2022

Education

U

Universidade Federal do Vale do São Francisco

Doctor of Medicine, Medicine

Doctor of Medicine
2019 - 2025

Work History

U

UBS

Medical Doctor

Ceará
2025 - Present
N

Núcleo Pré-vestibulares

Science Teacher

Petrolina
2018 - 2019