Evaluating robustness of "Code based" plans generated by LLMs
Evaluated the plans proposed by large language models for household tasks (e.g. make a sandwich, clean up the room) associated with images in a simulator. Assessed each plan (written in Python code) with the visual scene to see whether the plan would sufficiently complete the task, to what degree the plan matches my expectation of the task and how efficient the plan generated code was