[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang·December 02, 2024

Summary

FasterVLM is a training-free visual token pruning method for large vision-language models (VLMs) that addresses inefficiency due to numerous visual tokens. It evaluates token importance using [CLS] token-image token attentions, ensuring faster inference by eliminating redundant visual tokens post-visual encoder. FasterVLM maintains 90% performance of LLaVA-1.5-7B after pruning 95% of visual tokens. Its effectiveness is demonstrated across various VLMs, outperforming text-visual attention-based methods. FasterVLM directly utilizes the attention weights from the [CLS] token within the image encoder to assess the importance of each visual token, achieving maximum inference acceleration. This method accurately identifies visual tokens containing global information, preserving performance under high reduction ratios.

Key findings

Introduction

Background

Overview of large vision-language models (VLMs)

Challenges with numerous visual tokens in VLMs

Importance of efficient inference in VLMs

Objective

Aim of FasterVLM: addressing inefficiency in VLMs

Goal: maintaining high performance while pruning visual tokens

Method

Data Collection

Source of data for FasterVLM

Preprocessing steps for data

Data Preprocessing

Techniques used for preparing input data

Handling of [CLS] token and image tokens

Token Importance Evaluation

Methodology for assessing token importance

Utilization of [CLS] token-image token attentions

Token Pruning

Process of eliminating redundant visual tokens

Strategy for maintaining high performance post-pruning

Inference Acceleration

Techniques for achieving faster inference

Direct use of attention weights for token pruning

Performance Evaluation

Metrics for assessing FasterVLM's effectiveness

Comparison with text-visual attention-based methods

Results

Performance Metrics

Quantitative results on various VLMs

FasterVLM's impact on model size and performance

Case Studies

Detailed analysis of FasterVLM's application across different models

Outcomes and improvements observed

Conclusion

Summary of FasterVLM's contributions

Future Directions

Potential areas for further research

Expected advancements in VLM optimization

Basic info

papers

computer vision and pattern recognition

artificial intelligence

Advanced features

Insights

How does FasterVLM achieve maximum inference acceleration?

How does FasterVLM evaluate the importance of visual tokens?

What performance metrics does FasterVLM maintain after pruning a high percentage of visual tokens?

What is the main focus of the FasterVLM method?