For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
Martha Gerges

Martha Gerges

Senior Project Manager - Medicine and Medical Research

Australia flagSydney, Australia
$60.00/hrExpertData Annotation TechGoogle Cloud Vertex AILabel Studio

Key Skills

Software

Data Annotation TechData Annotation Tech
Google Cloud Vertex AIGoogle Cloud Vertex AI
Label StudioLabel Studio
MercorMercor
RemotasksRemotasks
Scale AIScale AI
TelusTelus

Top Subject Matter

Medical Research and Medicine
Regulatory Compliance & Risk Analysis
Chemistry, Biology and Psychology

Top Data Types

ImageImage
TextText
DocumentDocument

Top Task Types

Object Detection
Text Generation
Question Answering
Text Summarization
RLHF
Fine Tuning
Transcription
Evaluation Rating
Data Collection
Prompt Response Writing SFT
Classification

Freelancer Overview

I have extensive experience designing and evaluating training data for large language models, particularly in specialised domains such as healthcare, medicine, and clinical research. My work focuses on advanced prompt engineering, structured rubric development, and rigorous evaluation of model outputs. I have contributed to complex AI training initiatives including Lotus and Bullseye Prompting frameworks, where I design high precision prompts, develop scoring rubrics, and assess model responses against strict criteria such as atomicity, mutual exclusivity, reasoning quality, and factual accuracy. A large part of my role involves identifying reasoning errors, comparing alternative responses, and refining prompts and evaluation frameworks to systematically improve model performance. What sets my work apart is the integration of deep subject matter expertise with AI training methodology. Through my professional background in clinical trials operations, healthcare systems, and research governance, I develop training tasks that test real world reasoning rather than generic knowledge. I design prompts and evaluation frameworks across medical domains such as geriatrics, dermatology, nutrition, and clinical research, ensuring models are evaluated against realistic professional scenarios. My approach emphasises structured thinking, clear measurable evaluation criteria, and careful auditing of model outputs for logical consistency, compliance with project guidelines, and evidence based reasoning. This combination allows me to produce high quality training data that strengthens the reliability and practical capability of large language models.

ExpertArabicEnglish

Labeling Experience

AI Trainer

TextText Generation
I generate structured AI training text designed to test how well large language models understand instructions, reason through complex problems, and produce accurate, reliable responses. Much of the content I create consists of carefully engineered prompts that simulate real world tasks, along with detailed scoring rubrics that define how model outputs should be evaluated. These prompts often require multi step reasoning, interpretation of specialised information, or adherence to strict formatting and instruction constraints. I also write high quality reference responses and comparison examples that allow evaluators to assess differences in reasoning quality, factual accuracy, and instruction compliance. A large portion of the text I produce is designed for evaluation and benchmarking. This includes writing prompts that expose common model failure points such as hallucinations, weak reasoning chains, instruction violations, or inconsistencies in logic. I develop clear evaluation criteria that allow responses to be scored objectively, often using atomic and mutually exclusive rules to ensure consistency across assessments. The text I generate frequently incorporates specialised subject matter such as medical knowledge, clinical research processes, nutrition science, and professional decision making scenarios. The goal is to create training and evaluation data that strengthens a model’s ability to perform reliably in complex, real world contexts.

I generate structured AI training text designed to test how well large language models understand instructions, reason through complex problems, and produce accurate, reliable responses. Much of the content I create consists of carefully engineered prompts that simulate real world tasks, along with detailed scoring rubrics that define how model outputs should be evaluated. These prompts often require multi step reasoning, interpretation of specialised information, or adherence to strict formatting and instruction constraints. I also write high quality reference responses and comparison examples that allow evaluators to assess differences in reasoning quality, factual accuracy, and instruction compliance. A large portion of the text I produce is designed for evaluation and benchmarking. This includes writing prompts that expose common model failure points such as hallucinations, weak reasoning chains, instruction violations, or inconsistencies in logic. I develop clear evaluation criteria that allow responses to be scored objectively, often using atomic and mutually exclusive rules to ensure consistency across assessments. The text I generate frequently incorporates specialised subject matter such as medical knowledge, clinical research processes, nutrition science, and professional decision making scenarios. The goal is to create training and evaluation data that strengthens a model’s ability to perform reliably in complex, real world contexts.

2023 - Present

AI Trainer

Medical DicomPrompt Response Writing SFT
The projects I work on focus on building and evaluating high quality training data that improves how large language models reason, follow instructions, and perform in specialised domains. Much of my work involves designing complex prompts that test a model’s ability to interpret instructions, apply structured reasoning, and produce responses that meet strict evaluation criteria. I develop detailed rubrics that define exactly what constitutes a strong or weak response, then assess model outputs against those criteria. These projects often involve comparing multiple responses, identifying reasoning failures, detecting hallucinations or factual inaccuracies, and refining prompts to ensure the model is tested in a precise and measurable way. Many of the projects I contribute to are designed to simulate real world professional tasks rather than simple question and answer scenarios. For example, I build prompts that require models to analyse medical information, interpret clinical scenarios, or apply domain specific knowledge in fields such as geriatrics, dermatology, nutrition, and clinical research. These tasks require structured thinking, evidence based reasoning, and strict adherence to detailed guidelines such as atomic criteria, mutually exclusive scoring rules, and consistent evaluation standards. The goal is to create training datasets that push models to demonstrate reliable reasoning, accuracy, and instruction following under complex conditions that reflect real professional environments.

The projects I work on focus on building and evaluating high quality training data that improves how large language models reason, follow instructions, and perform in specialised domains. Much of my work involves designing complex prompts that test a model’s ability to interpret instructions, apply structured reasoning, and produce responses that meet strict evaluation criteria. I develop detailed rubrics that define exactly what constitutes a strong or weak response, then assess model outputs against those criteria. These projects often involve comparing multiple responses, identifying reasoning failures, detecting hallucinations or factual inaccuracies, and refining prompts to ensure the model is tested in a precise and measurable way. Many of the projects I contribute to are designed to simulate real world professional tasks rather than simple question and answer scenarios. For example, I build prompts that require models to analyse medical information, interpret clinical scenarios, or apply domain specific knowledge in fields such as geriatrics, dermatology, nutrition, and clinical research. These tasks require structured thinking, evidence based reasoning, and strict adherence to detailed guidelines such as atomic criteria, mutually exclusive scoring rules, and consistent evaluation standards. The goal is to create training datasets that push models to demonstrate reliable reasoning, accuracy, and instruction following under complex conditions that reflect real professional environments.

2023 - Present

Education

M

Macquarie University

Bachelor of Psychology, Psychology

Bachelor of Psychology
Not specified
U

University of New South Wales

Master of Science, Medicine

Master of Science
Not specified

Work History

T

The George Institute for Global Health

Senior Project Manager

Sydney
2024 - Present
T

The George Institute for Global Health

Project Manager

Sydney
2024 - 2024