Model Evaluation and Labeling Contributor
Contributed to building and evaluating datasets for model assessment and selection in a large language model (LLM) company. Created and labeled evaluation datasets for various natural language processing tasks, such as Q&A, reasoning, summarization, and information extraction. Assessed mainstream language models based on multiple criteria, including accuracy, hallucination rate, and robustness, to support business deployment. • Built datasets for Q&A, reasoning, summarization, and extraction. • Rated model outputs for accuracy and reliability. • Contributed to the design of model evaluation and selection processes. • Supported real-world deployment by evaluating model performance.