Evaluation in Various Subcategories
-Conversational Mode(or Voice) Evaluation: When a user gives a verbal request, a model represents a response in voice. It takes 15-20minutes in average to evaluate and compare two responses in various dimensions. The projects were relatively small sized, less than 1,000 tasks. -Cultural Relevance Evaluation: The prompt and the responses have to be suitable for domestic users, in this case, Korean users. A prompt has to be written in Korean. Responses have to be written in Korean without mistakes or errors in language aspects. The content has to be engaged, useful, and helpful for Korean users. The projects were middle sized, around 12,00 tasks. It takes 15-20minutes in average for a tasks. -Regular Evaluation: A project that I was highly appreciated, enough to be selected as a reviewer, was one that emphasized language dimensions, such as grammar, localization, fluency, structure, etc. It takes about 30minutes in average for a task.