For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
Matthew Xu

Matthew Xu

LLM Evaluation and Text Generation Specialist in English & Chinese

China flag深圳, China
$25.00/hrIntermediateData Annotation Tech

Key Skills

Software

Data Annotation TechData Annotation Tech

Top Subject Matter

No subject matter listed

Top Data Types

Computer Code ProgrammingComputer Code Programming
ImageImage
TextText

Top Task Types

Audio Recording
Classification
Evaluation Rating
Prompt Response Writing SFT
Translation Localization

Freelancer Overview

I have 1+ years of experience in DataAnnotation.tech, specializing in bilingual (English–Chinese) text analysis and model evaluation. In projects such as Argon and Raven, I labeled and reviewed over 5,000+ samples( including text, image and audio), achieving an annotation accuracy rate of over 98%. My work included evaluating AI outputs, identifying compliance risks, classifying harmful images, and refining model prompts. My bilingual background and experience in large-scale labeling projects enable me to deliver precise, consistent, and scalable annotation results.

IntermediateEnglishChinese Mandarin

Labeling Experience

Data Annotation Tech

Adversarial Tester

Data Annotation TechImageFine TuningEvaluation Rating
Argon On the Argon project, my primary task was to test and evaluate a large language model's (LLM) ability to identify and respond to harmful content in images. My responsibilities included: 1. Sourcing Harmful Images: I would find images from specific websites that contained harmful material, such as sexually explicit content, violence, political extremism, or other sensitive topics. 2. Prompting the Model: I would then input these images into the model along with a text prompt. The goal was to see if the model could correctly identify the harmful nature of the image and refuse to engage with it, or provide a safe and appropriate response. 3. Scoring and Editing: I meticulously scored the model's responses based on its ability to recognize the harmful content and adhere to safety guidelines. If the model's response was inadequate, I would edit it to provide an example of the ideal, safe, and helpful response it should have given.

Argon On the Argon project, my primary task was to test and evaluate a large language model's (LLM) ability to identify and respond to harmful content in images. My responsibilities included: 1. Sourcing Harmful Images: I would find images from specific websites that contained harmful material, such as sexually explicit content, violence, political extremism, or other sensitive topics. 2. Prompting the Model: I would then input these images into the model along with a text prompt. The goal was to see if the model could correctly identify the harmful nature of the image and refuse to engage with it, or provide a safe and appropriate response. 3. Scoring and Editing: I meticulously scored the model's responses based on its ability to recognize the harmful content and adhere to safety guidelines. If the model's response was inadequate, I would edit it to provide an example of the ideal, safe, and helpful response it should have given.

2024
Data Annotation Tech

AI Model Evaluation and Annotation

Data Annotation TechTextText GenerationFine Tuning
Raven Project On the Raven project, my main responsibility was to perform adversarial testing on large language models (LLMs). My goal was to identify and expose the weaknesses of the chatbot by creating highly difficult and challenging prompts. 1. Designing High-Difficulty Prompts: I had to come up with complex and tricky prompts specifically designed to make the chatbot produce errors or undesirable responses. These errors could fall into several categories, such as: * Localization: * Instruction Following * Truthfulness * Harmfulness * Writing Quality 2. Creating Rubrics: For each test, I manually developed 5-15 detailed evaluation rubrics. These were the criteria I used to systematically and objectively judge the model's performance, ensuring consistency in my assessments. 3. Editing and Scoring the Models: After the model provided a response, I would edit the best answer and score them separately.

Raven Project On the Raven project, my main responsibility was to perform adversarial testing on large language models (LLMs). My goal was to identify and expose the weaknesses of the chatbot by creating highly difficult and challenging prompts. 1. Designing High-Difficulty Prompts: I had to come up with complex and tricky prompts specifically designed to make the chatbot produce errors or undesirable responses. These errors could fall into several categories, such as: * Localization: * Instruction Following * Truthfulness * Harmfulness * Writing Quality 2. Creating Rubrics: For each test, I manually developed 5-15 detailed evaluation rubrics. These were the criteria I used to systematically and objectively judge the model's performance, ensuring consistency in my assessments. 3. Editing and Scoring the Models: After the model provided a response, I would edit the best answer and score them separately.

2024

Education

S

Sun Yat-sen University

Bachelor of Economics, Economics

Bachelor of Economics
2021 - 2025

Work History

F

Futu Holdings Limited

Community Product Manager Intern

Shenzhen
2024 - 2024
P

PwC

SaaS Product Manager Intern

Guangzhou
2023 - 2023