Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Bingchen Zhao, Yongshuo Zong, Letian Zhang, Timothy Hospedales·June 18, 2024

Summary

The paper introduces the Multi-Image Relational Benchmark (MIRB), a comprehensive evaluation tool for visual language models (VLMs) that focuses on multi-image understanding. MIRB addresses the gap in existing benchmarks by assessing VLMs' ability to compare, analyze, and reason across multiple images, encompassing perception, knowledge, reasoning, and multi-hop reasoning tasks. It finds that while open-source models perform well in single-image tasks, they significantly underperform in multi-image reasoning, with even the state-of-the-art GPT-4V model struggling. MIRB consists of four key dimensions: visual world knowledge, multi-image reasoning, perception, and multi-hop reasoning, with tasks that test the models' comprehension and problem-solving abilities in diverse scenarios. The benchmark highlights the need for improved models that can effectively handle multi-image situations and serves as a valuable resource for advancing multi-modal AI research.

Key findings

2

Advanced features