Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han·May 22, 2025
Summary
Circle-RoPE introduces a novel encoding scheme for large vision-language models, addressing cross-modal biases. It proposes Per-Token Distance (PTD) for quantifying the independence of positional encodings across modalities. Circle-RoPE maps image token indices onto a circular trajectory, forming a cone-like structure to ensure each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. The method enhances model performance through a staggered layer strategy, applying different RoPE variants across layers. Experimental results show effective preservation of spatial information from images while reducing relative positional bias, offering a robust and flexible positional encoding framework for LVLMs.
Advanced features