For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
M
Mincho Kraevski

Mincho Kraevski

Software Architect | Expert AI Evaluation (RLHF) | High-Scale iGaming Systems

Bulgaria flagPlovdiv, Bulgaria
$45.00/hrExpertInternal Proprietary Tooling

Key Skills

Software

Internal/Proprietary Tooling

Top Subject Matter

High-Scale iGaming & Distributed Systems Architecture
Performance Optimization & Scalability
AI-Native Engineering & Agentic Workflow Orchestration

Top Data Types

TextText
DocumentDocument
Computer Code ProgrammingComputer Code Programming

Top Task Types

Evaluation/RatingEvaluation/Rating
Computer Programming/CodingComputer Programming/Coding
Text GenerationText Generation
Question AnsweringQuestion Answering
RLHFRLHF

Freelancer Overview

As a Software Engineering Architect with 15+ years of experience in high-scale distributed systems (100k+ req/min), I specialize in Expert RLHF, Red Teaming, and Evaluation of LLM-generated code. I provide ground truth validation, ranking, and correction of model outputs, focusing on performance-critical backend systems where latency, scalability, and data integrity are non-negotiable. My evaluations go beyond surface correctness - I identify hidden failure modes such as inefficient database queries, race conditions, poor concurrency handling, and security risks that weaker reviewers and models often miss. I operate in a fully AI-native workflow, benchmarking outputs across multiple models (Qwen, Kimi, GLM, Gemma vs. Claude, GPT, Gemini) to determine the most aligned and production-ready solutions. I apply structured evaluation frameworks to consistently produce gold standard outputs used for RLHF and Supervised Fine-Tuning (SFT). In addition, I design agentic evaluation pipelines and prompt frameworks that improve model reliability, reduce hallucinations, and optimize token efficiency. My expertise spans backend systems, infrastructure, and full-stack architectures, with a strong focus on real-world production constraints and cost-aware engineering.

ExpertEnglishBulgarian

Labeling Experience

Supervised Fine-Tuning (SFT) Evaluation of Multi-Model LLM Code Generation Pipelines

Computer Code ProgrammingPrompt Response Writing SFT
Designed and evaluated a multi-agent LLM development system to improve model performance across full-stack engineering tasks. Curated high-quality supervised examples (gold standard outputs) and used them to guide SFT-style evaluation workflows. Generated and compared outputs from open-weight models (Qwen, Kimi, GLM, Gemma) and proprietary models (Claude, GPT, Gemini), evaluating them across correctness, architectural integrity, type safety, and cross-service integration. Performed ranking, correction, and iterative refinement, identifying common failure patterns such as broken async flows, missing edge-case handling, and invalid API contracts. Developed reusable prompting frameworks and evaluation rubrics that significantly improved output consistency and reduced token usage while increasing quality.

Designed and evaluated a multi-agent LLM development system to improve model performance across full-stack engineering tasks. Curated high-quality supervised examples (gold standard outputs) and used them to guide SFT-style evaluation workflows. Generated and compared outputs from open-weight models (Qwen, Kimi, GLM, Gemma) and proprietary models (Claude, GPT, Gemini), evaluating them across correctness, architectural integrity, type safety, and cross-service integration. Performed ranking, correction, and iterative refinement, identifying common failure patterns such as broken async flows, missing edge-case handling, and invalid API contracts. Developed reusable prompting frameworks and evaluation rubrics that significantly improved output consistency and reduced token usage while increasing quality.

2025 - Present

LLM Evaluation & Rating of Production-Grade Backend and API Code

Computer Code ProgrammingEvaluation Rating
Performed structured evaluation and rating of LLM-generated backend code for a large-scale TypeScript/Node.js platform with 100+ database models and complex domain logic (payments, KYC, real-time events). Established evaluation rubrics covering correctness, scalability, security, and maintainability, and applied them to outputs generated by multiple models. Compared responses against ground truth implementations derived from production systems. Ranked outputs based on their ability to handle concurrency, edge cases, and integration with external systems (Kafka, Redis, WebSockets, payment providers). Provided detailed feedback to improve model alignment and code quality, particularly in high-risk areas such as financial transactions and real-time systems.

Performed structured evaluation and rating of LLM-generated backend code for a large-scale TypeScript/Node.js platform with 100+ database models and complex domain logic (payments, KYC, real-time events). Established evaluation rubrics covering correctness, scalability, security, and maintainability, and applied them to outputs generated by multiple models. Compared responses against ground truth implementations derived from production systems. Ranked outputs based on their ability to handle concurrency, edge cases, and integration with external systems (Kafka, Redis, WebSockets, payment providers). Provided detailed feedback to improve model alignment and code quality, particularly in high-risk areas such as financial transactions and real-time systems.

2026 - Present

RLHF-Based Optimization of LLM Outputs for Frontend Performance & SSR Reliability

Computer Code ProgrammingRLHF
Evaluated and optimized LLM-generated frontend code (SvelteKit, SSR applications) using RLHF methodologies. Focused on identifying model failures related to hydration mismatches, state management, and client-server synchronization. Generated multiple candidate implementations using different models and performed ranking and correction based on real-world production behavior. Identified subtle issues such as SSR/client divergence, broken modal interactions, and incorrect lifecycle handling. Refined outputs into gold standard solutions, including client-only guards, hydration-safe patterns, and correct async state synchronization. This improved model alignment for modern frontend frameworks and reduced production bugs caused by incorrect SSR assumptions.

Evaluated and optimized LLM-generated frontend code (SvelteKit, SSR applications) using RLHF methodologies. Focused on identifying model failures related to hydration mismatches, state management, and client-server synchronization. Generated multiple candidate implementations using different models and performed ranking and correction based on real-world production behavior. Identified subtle issues such as SSR/client divergence, broken modal interactions, and incorrect lifecycle handling. Refined outputs into gold standard solutions, including client-only guards, hydration-safe patterns, and correct async state synchronization. This improved model alignment for modern frontend frameworks and reduced production bugs caused by incorrect SSR assumptions.

2026 - 2026

Adversarial Red Teaming of LLM-Generated SQL & Backend Query Logic

Computer Code ProgrammingRed Teaming
Conducted adversarial testing of LLM-generated SQL and ORM (Prisma) queries in a high-load PostgreSQL environment with multi-million-row tables. Designed prompts to intentionally expose weaknesses in model reasoning around query planning, indexing, and cost optimization. Benchmarked outputs from multiple models against ground truth optimized implementations, identifying systemic issues such as full table scans, misuse of COUNT(*), missing indexes, and inefficient aggregations. Performed red-team evaluation and corrective feedback loops, transforming naive outputs into gold standard solutions, including reltuples-based estimation and EXPLAIN-driven query strategies—achieving 950x–5000x performance improvements. This project highlighted critical gaps in model understanding of database performance under production constraints.

Conducted adversarial testing of LLM-generated SQL and ORM (Prisma) queries in a high-load PostgreSQL environment with multi-million-row tables. Designed prompts to intentionally expose weaknesses in model reasoning around query planning, indexing, and cost optimization. Benchmarked outputs from multiple models against ground truth optimized implementations, identifying systemic issues such as full table scans, misuse of COUNT(*), missing indexes, and inefficient aggregations. Performed red-team evaluation and corrective feedback loops, transforming naive outputs into gold standard solutions, including reltuples-based estimation and EXPLAIN-driven query strategies—achieving 950x–5000x performance improvements. This project highlighted critical gaps in model understanding of database performance under production constraints.

2026 - 2026

LLM Code Evaluation & Alignment for Distributed Observability Systems

Computer Code ProgrammingRLHF
Acted as an Expert SME performing RLHF-based evaluation of LLM-generated infrastructure and observability code for a high-scale distributed system deployed on AWS (EKS, RDS, Kafka). Leveraged a multi-model benchmarking pipeline (Qwen, Kimi, GLM, Gemma vs. Claude, GPT, Gemini) to generate candidate implementations for monitoring, alerting, and metrics instrumentation. Established production-grade ground truth implementations for Grafana dashboards, Prometheus metrics, alert rules and PagerDuty integration then performed ranking, correction, and alignment of model outputs. Evaluated responses on correctness, operational reliability, alert quality, and adherence to SRE best practices. Identified failure modes such as hallucinated metrics, invalid PromQL queries, and misconfigured alert thresholds. Refined outputs into gold standard configurations, resulting in 15 production dashboards and 93 alert rules. This work improved model alignment for real-world DevOps scenarios, ensuring outputs were production-safe under high-load conditions.

Acted as an Expert SME performing RLHF-based evaluation of LLM-generated infrastructure and observability code for a high-scale distributed system deployed on AWS (EKS, RDS, Kafka). Leveraged a multi-model benchmarking pipeline (Qwen, Kimi, GLM, Gemma vs. Claude, GPT, Gemini) to generate candidate implementations for monitoring, alerting, and metrics instrumentation. Established production-grade ground truth implementations for Grafana dashboards, Prometheus metrics, alert rules and PagerDuty integration then performed ranking, correction, and alignment of model outputs. Evaluated responses on correctness, operational reliability, alert quality, and adherence to SRE best practices. Identified failure modes such as hallucinated metrics, invalid PromQL queries, and misconfigured alert thresholds. Refined outputs into gold standard configurations, resulting in 15 production dashboards and 93 alert rules. This work improved model alignment for real-world DevOps scenarios, ensuring outputs were production-safe under high-load conditions.

2026 - 2026

Education

U

University of Plovdiv “Paisii Hilendarski”

Bachelor of Science, Informatics

Bachelor of Science
2010 - 2014

Work History

C

Confidential

Engineering Manager and Full-Stack Engineer

Plovdiv
2025 - Present
D

DraftKings

Software Engineering Manager

Plovdiv
2019 - 2025