AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

Fanglong Yao, Yuanchang Yue, Youzhi Liu, Xian Sun, Kun Fu·August 28, 2024

Summary

AeroVerse is a benchmark suite designed for simulating, pre-training, finetuning, and evaluating aerospace embodied world models. It addresses the gap in research on UAV intelligent agents by developing AeroSimulator, a simulation platform with realistic urban scenes for UAV flight simulation. The suite includes AerialAgent-Ego10k, a large-scale real-world image-text pre-training dataset, and CyberAgent-Ego500k, a virtual image-text-pose alignment dataset. Five downstream tasks—scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision—are clearly defined, with corresponding instruction datasets for fine-tuning. SkyAgent-Eval, based on GPT-4, serves as the evaluation metric for comprehensive, flexible, and objective assessment. The benchmark suite integrates over 10 2D/3D visual-language models, pre-training datasets, finetuning datasets, and evaluation metrics, aiming to promote exploration and development in aerospace embodied intelligence. The paper introduces a benchmark suite for aerospace embodiment, AeroVerse, which includes a simulation platform, two real-virtual pre-training datasets, and five downstream task instruction datasets. The suite aims to address the complexity of tasks for unmanned aerial vehicle (UAV) agents, such as indoor/outdoor navigation, command following, and embodied question answering. The diversity and interdependence of these tasks lead to unclear definitions for aerial-embodied agents. The paper highlights the difficulty in acquiring 3D data for UAVs, especially outdoors, due to the need for specialized equipment and skilled professionals. The high cost of UAV embodied data collection is also mentioned, considering the extensive training required for annotators. To tackle these issues, the paper defines five downstream tasks for UAV-embodied agents: scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision. It also constructs the first large-scale virtual-reality pre-training dataset and high-quality instruction dataset, including a first-person, high-resolution real-world pre-training dataset and five downstream task instruction datasets created using the established simulation platform, AeroSimulator. The paper introduces AerialAgent-Ego10k, a large-scale real-world image-text pre-training dataset, and CyberAgent-Ego500k, a virtual image-text-posture alignment dataset for the Aerospace Embodied World Model. AerialAgent-Ego10k uses urban UAVs as the primary viewpoint, while CyberAgent-Ego500k contains 500K aligned UAV postures, images, and text descriptions from four 3D urban environments. The dataset was collected over eight months by ten professional annotators, ensuring high-quality data for downstream tasks. The paper also develops SkyAgent-Eval, an automated evaluation approach based on GPT-4, for three types of downstream tasks: scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision-making. This approach employs few-shot instruction and context learning to cater to the customized evaluation needs of various tasks. The paper conducts extensive experiments using ten mainstream baselines on the downstream instruction datasets, revealing the potential and limitations of 2D/3D visual-language models in UAV-agent tasks. The results underscore the necessity of constructing an aerospace embodied world model. The paper also designs a benchmark suite, AeroVerse, which includes over 10 2D/3D visual-language models, 2 pre-training datasets, 5 downstream task instruction datasets, and 10+ evaluation metrics, along with a simulator featuring 4 urban scenarios. In conclusion, the AeroVerse benchmark suite is a comprehensive tool for advancing research in aerospace embodied intelligence, addressing the challenges of UAV agent tasks through simulation, pre-training, fine-tuning, and evaluation. The suite's integration of datasets, models, and evaluation metrics aims to promote exploration and development in this field, facilitating the creation of more capable and versatile UAV agents for real-world applications.

Key findings

Advanced features