Response Rater
One project I worked frequently involved rating the AI response to a given user prompt in a mobile phone assistant context. Some level of this was based on inferring what the user wanted from the prompt, and then seeing how the assistant interpreted it in the context of their system. "Pause" would likely mean to pause currently playing music, as an example, so the decisions are highly contextual.