For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
Dwiki Prakasa

Dwiki Prakasa

Software Engineer - Web Frontend Development and AI Training

INDONESIA flag
Jakarta, Indonesia
$25.00/hrIntermediateMindriftToloka

Key Skills

Software

MindriftMindrift
TolokaToloka

Top Subject Matter

No subject matter listed

Top Data Types

Computer Code ProgrammingComputer Code Programming
ImageImage
TextText

Top Label Types

Question Answering
RLHF
Evaluation Rating
Computer Programming Coding
Prompt Response Writing SFT
Text Generation

Freelancer Overview

I am an experienced software engineer and freelance AI trainer with a strong background in data annotation, coding evaluation, and AI training data creation. My work involves designing and delivering complex programming and STEM tasks, developing coding prompts, and executing detailed side-by-side LLM comparisons across multiple quality dimensions such as truthfulness, instruction following, and emotional intelligence. I have hands-on experience with Python, NumPy, and Pandas for technical validation and logical analysis of AI-generated outputs, ensuring accuracy and consistency in data labeling and response assessment. I am skilled at collaborating in distributed reviewer environments, adhering to strict annotation guidelines, and applying rigorous evaluation criteria to enhance AI model reliability and performance. My expertise spans software development, prompt engineering, and reinforcement learning from human feedback (RLHF), making me passionate about advancing high-quality AI systems through precise data annotation and evaluation.

IntermediateIndonesian

Labeling Experience

Mindrift

AI Model Evaluation & Task Generation STEM Domains for Computer Science (Python)

MindriftComputer Code ProgrammingQuestion AnsweringRLHF
Acting as a Subject Matter Expert (SME) to train and evaluate Large Language Models (LLMs) in the domain of Computer Science, Mathematics, Physics and Python programming. Key Responsibilities: * Task Generation: Creating computationally intensive STEM problems that require multi-step reasoning and Python coding to solve, designed specifically to challenge and identify reasoning failures in AI models. Model Evaluation (RLHF): Evaluating AI-generated code for correctness, efficiency, and adherence to constraints. Analyzing failure modes such as logic errors, hallucinations, or suboptimal algorithms. Golden Solution Creation: Developing deterministic, reproducible, and efficient Python solutions (using libraries like pandas, numpy, scipy) alongside clear, human-readable explanations to serve as ground truth for model training. Vibe Coding/Rapid Prototyping: executing rapid coding tasks to correct AI responses based on programming standards and clean code principles.

Acting as a Subject Matter Expert (SME) to train and evaluate Large Language Models (LLMs) in the domain of Computer Science, Mathematics, Physics and Python programming. Key Responsibilities: * Task Generation: Creating computationally intensive STEM problems that require multi-step reasoning and Python coding to solve, designed specifically to challenge and identify reasoning failures in AI models. Model Evaluation (RLHF): Evaluating AI-generated code for correctness, efficiency, and adherence to constraints. Analyzing failure modes such as logic errors, hallucinations, or suboptimal algorithms. Golden Solution Creation: Developing deterministic, reproducible, and efficient Python solutions (using libraries like pandas, numpy, scipy) alongside clear, human-readable explanations to serve as ground truth for model training. Vibe Coding/Rapid Prototyping: executing rapid coding tasks to correct AI responses based on programming standards and clean code principles.

2025
Mindrift

Side-by-Side (SxS) LLM Evaluation & Conversational Analysis (Apricot Project)

MindriftTextRLHFEvaluation Rating
Executing expert-level Side-by-Side (SxS) comparisons of Large Language Model (LLM) responses. The role involves evaluating model performance across 10 specific quality dimensions, including Instruction Following, Truthfulness, and Harmlessness. Specialized in "Conversationality" tasks, assessing models on advanced metrics such as: * Natural Dialogue: Evaluating if the response mirrors human speech patterns and flow. * User Intent & Understanding: Measuring the model's emotional intelligence (EQ) and ability to grasp implicit user needs. * Conversation Continuation: Assessing how effectively the model drives the dialogue forward. Responsibilities include assigning 1-5 ratings for each dimension and writing detailed, evidence-based rationales to justify preference decisions.

Executing expert-level Side-by-Side (SxS) comparisons of Large Language Model (LLM) responses. The role involves evaluating model performance across 10 specific quality dimensions, including Instruction Following, Truthfulness, and Harmlessness. Specialized in "Conversationality" tasks, assessing models on advanced metrics such as: * Natural Dialogue: Evaluating if the response mirrors human speech patterns and flow. * User Intent & Understanding: Measuring the model's emotional intelligence (EQ) and ability to grasp implicit user needs. * Conversation Continuation: Assessing how effectively the model drives the dialogue forward. Responsibilities include assigning 1-5 ratings for each dimension and writing detailed, evidence-based rationales to justify preference decisions.

2025 - 2025
Toloka

High-Integrity Prompt Engineering & Model Evaluation (MEI/Toloka)

TolokaComputer Code ProgrammingText GenerationRLHF
Executing high-complexity prompt engineering and model evaluation under the "High-Integrity Standard" framework. The role requires adhering to a strict "3-Gate Quality Framework" to generate domain-specific tasks (specifically in Computer Science/Python) that are rigorous, realistic, and solvable. Key responsibilities include: * Engineered Prompt Creation: Designing prompts that pass strict criteria for "Objective Truth" and "Step Validity" , ensuring tasks utilize verifiable reasoning chains rather than subjective requests. * Multi-Dimensional Evaluation: assessing AI model responses across 5 distinct dimensions: Harmlessness, Correctness, Step Validity, Completeness, and Clarity. * Defensible Scoring: Applying the "Zero-Assumption Rule" to verify every line of code or reasoning step, providing evidence-based justifications for every score.

Executing high-complexity prompt engineering and model evaluation under the "High-Integrity Standard" framework. The role requires adhering to a strict "3-Gate Quality Framework" to generate domain-specific tasks (specifically in Computer Science/Python) that are rigorous, realistic, and solvable. Key responsibilities include: * Engineered Prompt Creation: Designing prompts that pass strict criteria for "Objective Truth" and "Step Validity" , ensuring tasks utilize verifiable reasoning chains rather than subjective requests. * Multi-Dimensional Evaluation: assessing AI model responses across 5 distinct dimensions: Harmlessness, Correctness, Step Validity, Completeness, and Clarity. * Defensible Scoring: Applying the "Zero-Assumption Rule" to verify every line of code or reasoning step, providing evidence-based justifications for every score.

2025 - 2025
Mindrift

AI Agent Evaluation & Tool Use Benchmarking (TAU Framework)

MindriftTextEvaluation Rating
Evaluating AI agents within the Tool-Agent-User (TAU) framework to benchmark reliability in realistic scenarios. The role focuses on "Trajectory Evaluation," assessing whether agents correctly solve user problems by utilizing specific tools (functions) and strictly adhering to domain policies. Key responsibilities include: * Trajectory Correctness: Analyzing full user-agent conversations to identify "Agent Faults" (policy violations, wrong tool usage) versus "User Faults," ensuring the agent's reasoning process is sound even if the final outcome is numerically correct. * Golden Set Verification: Defining and editing the "Golden Set"—the required sequence of tool calls (Read-only vs. DB-modifying) needed to correctly fulfill a request. * Database Logic: Verifying how agents interact with structured JSON databases and ensuring parameters in tool calls match the database fields.

Evaluating AI agents within the Tool-Agent-User (TAU) framework to benchmark reliability in realistic scenarios. The role focuses on "Trajectory Evaluation," assessing whether agents correctly solve user problems by utilizing specific tools (functions) and strictly adhering to domain policies. Key responsibilities include: * Trajectory Correctness: Analyzing full user-agent conversations to identify "Agent Faults" (policy violations, wrong tool usage) versus "User Faults," ensuring the agent's reasoning process is sound even if the final outcome is numerically correct. * Golden Set Verification: Defining and editing the "Golden Set"—the required sequence of tool calls (Read-only vs. DB-modifying) needed to correctly fulfill a request. * Database Logic: Verifying how agents interact with structured JSON databases and ensuring parameters in tool calls match the database fields.

2025 - 2025

Education

U

University of Lampung

Bachelor of Computer Science, Computer Science

Bachelor of Computer Science
2016 - 2021

Work History

M

Mindrift by Toloka

Freelance AI Trainer / AI Coding Evaluator

Jakarta
2025 - Present
P

Prima Vista Solusi

Software Engineer (Frontend)

South Jakarta
2021 - Present