[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang·December 02, 2024
Summary
FasterVLM is a training-free visual token pruning method for large vision-language models (VLMs) that addresses inefficiency due to numerous visual tokens. It evaluates token importance using [CLS] token-image token attentions, ensuring faster inference by eliminating redundant visual tokens post-visual encoder. FasterVLM maintains 90% performance of LLaVA-1.5-7B after pruning 95% of visual tokens. Its effectiveness is demonstrated across various VLMs, outperforming text-visual attention-based methods. FasterVLM directly utilizes the attention weights from the [CLS] token within the image encoder to assess the importance of each visual token, achieving maximum inference acceleration. This method accurately identifies visual tokens containing global information, preserving performance under high reduction ratios.
Introduction
Background
Overview of large vision-language models (VLMs)
Challenges with numerous visual tokens in VLMs
Importance of efficient inference in VLMs
Objective
Aim of FasterVLM: addressing inefficiency in VLMs
Goal: maintaining high performance while pruning visual tokens
Method
Data Collection
Source of data for FasterVLM
Preprocessing steps for data
Data Preprocessing
Techniques used for preparing input data
Handling of [CLS] token and image tokens
Token Importance Evaluation
Methodology for assessing token importance
Utilization of [CLS] token-image token attentions
Token Pruning
Process of eliminating redundant visual tokens
Strategy for maintaining high performance post-pruning
Inference Acceleration
Techniques for achieving faster inference
Direct use of attention weights for token pruning
Performance Evaluation
Metrics for assessing FasterVLM's effectiveness
Comparison with text-visual attention-based methods
Results
Performance Metrics
Quantitative results on various VLMs
FasterVLM's impact on model size and performance
Case Studies
Detailed analysis of FasterVLM's application across different models
Outcomes and improvements observed
Conclusion
Summary of FasterVLM's contributions
Future Directions
Potential areas for further research
Expected advancements in VLM optimization
Basic info
papers
computer vision and pattern recognition
artificial intelligence
Advanced features