MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents
Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, Mao Qin, Yinxiao Chen, Chen Peng, Shangguang Wang, Mengwei Xu·June 09, 2025
Summary
MCPWorld benchmarks API, GUI, and hybrid agents using large language models. It features 201 tasks, supports white-box apps, and offers flexible GPU containerization. Initial experiments show 75.12% task completion accuracy, indicating practical automation potential. Tips for efficient computer use include zooming, chaining function calls, and avoiding Firefox wizards. BashTool enhances agent performance, and the study details 36 Blender tasks, ranging from basic to complex operations.
Advanced features