Human Lever
Worked on Humanity’s Last Exam Dataset project, generated high-quality dataset designed to evaluate the reasoning, comprehension, and problem-solving abilities of LLMs. The project involved creating diverse and challenging prompts across multiple domains, including logic, mathematics, ethics, hypothetical scenarios, and high-stakes decision—making, intended to test LLM performance under complex, human-level cognitive conditions. Data generation included constructing question–answer pairs, multi-step reasoning tasks, edge-case scenarios, and adversarial examples to benchmark model robustness. Strict quality control procedures were applied, such as manual validation, consistency checks, and automated filtering to ensure difficulty, clarity, and originality. The resulting dataset provides a comprehensive and demanding evaluation framework aimed at pushing the boundaries of LLM understanding and reasoning capabilities.