WebCanvas: Benchmarking Web Agents in Online Environments
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu·June 18, 2024
Summary
WebCanvas is a dynamic evaluation framework for web agents that addresses the adaptability of agents to evolving web environments. It introduces a novel Mind2Web-Live benchmark dataset with 542 tasks and 2,439 intermediate states, along with lightweight annotation tools. The framework focuses on realistic assessments, with a best-performing agent achieving a task success rate of 23.1% and completion rate of 48.8%. Key nodes are essential for measuring progress and adaptation, and the platform encourages community contributions. The study highlights the need for real-time data gathering and online benchmarking, revealing discrepancies between offline and online model performance, and the importance of key nodes for accurate evaluation. WebCanvas aims to drive advancements in autonomous web agents by facilitating continuous and scalable data collection in dynamic web scenarios.
Advanced features