handshake
Scope Project Aether is a large-scale human data initiative focused on improving reasoning and instruction-following capabilities in frontier large language models. The project sits at the intersection of expert annotation and scalable data infrastructure, targeting advanced post-training pipelines (RLHF, GRPO, and iterative preference optimization). Aether addresses the growing need for high-signal, domain-diverse training data that goes beyond generic web corpora—specifically targeting complex reasoning traces, multi-step problem solving, and nuanced human preference modeling across technical and professional domains. Data Labeling Tasks Performed • Reasoning Trace Annotation: Experts decompose complex problems into step-by-step logical chains, flagging errors, unstated assumptions, and alternative solution paths. • Preference Pair Ranking: Human raters compare model-generated responses across dimensions including factual accuracy, coherence, helpfulness, and safety, producing ranked preference pairs for reward model training. • Instruction Refinement: Rewriting ambiguous or underspecified prompts to improve clarity and reduce variance in model outputs. • Domain-Specific Validation: Subject-matter experts (software engineers, scientists, legal professionals) verify technical correctness in specialized response sets. • Synthetic Data Verification: Auditing AI-generated training examples for hallucinations, logical inconsistencies, and distributional drift before inclusion in fine-tuning datasets. Project Size • Annotator Pool: 200+ vetted experts across 15+ domains (computer science, mathematics, medicine, law, finance) • Data Volume: ~500K labeled examples generated per quarter; ~2M preference pairs ranked to date • Geographic Coverage: Multilingual annotation spanning English, Spanish, Mandarin, and German to support global model deployment • Infrastructure: Custom labeling interface built on Handshake AI's internal platform with integrated quality dashboards and real-time inter-annotator agreement tracking Quality Measures • Inter-Annotator Agreement (IAA): Target Cohen's Kappa > 0.75 on all ranking tasks; weekly calibration sessions for edge-case alignment • Expert Credentialing: Domain-specific qualification exams with 85% pass threshold; continuous performance monitoring via gold-standard hidden tests • Audit Trail: Full provenance logging for every labeled example—annotator ID, time spent, revision history, and adjudication records • Bias & Fairness Audits: Quarterly demographic parity checks across gender, ethnicity, and geographic origin in preference distributions • Data Freshness Protocols: TTL (time-to-live) policies ensuring training data reflects current knowledge cutoffs; automatic deprecation of outdated factual claims