Matthew Xu - LLM Evaluation and Text Generation Specialist in English & Chinese

Key Skills

Software

Data Annotation Tech

Top Subject Matter

No subject matter listed

Top Data Types

Computer Code Programming

Image

Text

Top Task Types

Audio Recording

Classification

Evaluation Rating

Prompt Response Writing SFT

Translation Localization

Freelancer Overview

I have 1+ years of experience in DataAnnotation.tech, specializing in bilingual (English–Chinese) text analysis and model evaluation. In projects such as Argon and Raven, I labeled and reviewed over 5,000+ samples( including text, image and audio), achieving an annotation accuracy rate of over 98%. My work included evaluating AI outputs, identifying compliance risks, classifying harmful images, and refining model prompts. My bilingual background and experience in large-scale labeling projects enable me to deliver precise, consistent, and scalable annotation results.

IntermediateEnglishChinese Mandarin

Labeling Experience

Adversarial Tester

Data Annotation TechImageFine TuningEvaluation Rating

Argon On the Argon project, my primary task was to test and evaluate a large language model's (LLM) ability to identify and respond to harmful content in images. My responsibilities included: 1. Sourcing Harmful Images: I would find images from specific websites that contained harmful material, such as sexually explicit content, violence, political extremism, or other sensitive topics. 2. Prompting the Model: I would then input these images into the model along with a text prompt. The goal was to see if the model could correctly identify the harmful nature of the image and refuse to engage with it, or provide a safe and appropriate response. 3. Scoring and Editing: I meticulously scored the model's responses based on its ability to recognize the harmful content and adhere to safety guidelines. If the model's response was inadequate, I would edit it to provide an example of the ideal, safe, and helpful response it should have given.

2024

AI Model Evaluation and Annotation

Data Annotation TechTextText GenerationFine Tuning

Raven Project On the Raven project, my main responsibility was to perform adversarial testing on large language models (LLMs). My goal was to identify and expose the weaknesses of the chatbot by creating highly difficult and challenging prompts. 1. Designing High-Difficulty Prompts: I had to come up with complex and tricky prompts specifically designed to make the chatbot produce errors or undesirable responses. These errors could fall into several categories, such as: * Localization: * Instruction Following * Truthfulness * Harmfulness * Writing Quality 2. Creating Rubrics: For each test, I manually developed 5-15 detailed evaluation rubrics. These were the criteria I used to systematically and objectively judge the model's performance, ensuring consistency in my assessments. 3. Editing and Scoring the Models: After the model provided a response, I would edit the best answer and score them separately.

2024

Education

S

Sun Yat-sen University

Bachelor of Economics, Economics

Bachelor of Economics

2021 - 2025

Work History

F

Futu Holdings Limited

Community Product Manager Intern

Shenzhen

2024 - 2024

P

PwC

SaaS Product Manager Intern

Guangzhou

2023 - 2023