LLM Evaluation and Prompt Engineer (Recombine.ai)
Labeled, evaluated, and annotated Russian-language model outputs for LLMs, covering accuracy, completeness, style, and safety. Built, maintained, and analyzed over 1,900 evaluation test suites for prompt and dataset improvement. Conducted rubric-based scoring, error tagging, peer review, and provided feedback to continually improve guideline consistency and dataset quality.• Constructed challenging edge-case scenarios for stress testing• Performed prompt engineering for test suites• Led peer reviews and guideline refinement• Identified, tagged, and documented error types and failure patterns