On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier

Jiachen Jiang, Jinxin Zhou, Zhihui Zhu·June 20, 2024

Summary

This paper investigates the similarity of internal representations in transformer models, focusing on hidden layer relationships. It finds that cosine similarity effectively captures layer-wise similarity, which decreases with distance, and proposes aligned training to enhance this similarity. This method allows early layer classification and improves multi-exit models' performance, making them competitive with shallow-layer architectures. The aligned training involves a common classifier and cross-entropy loss minimization, benefiting both vision (CIFAR-10, ImageNet) and NLP (GLUE, WikiText-103) tasks. The research highlights the role of residual connections, neural collapse, and the efficiency of the proposed approach in reducing model complexity and computational requirements. Experiments demonstrate improved accuracy, efficiency, and feature alignment across various tasks and models.

Key findings

11

Introduction
Background
Evolution of transformer models in NLP and vision tasks
Importance of internal representations for model performance
Objective
To analyze layer-wise similarity in transformers
Propose aligned training for enhanced representation similarity
Improve multi-exit models' efficiency and competitiveness
Method
Data Collection
Selection of diverse transformer models (e.g., ViT, BERT)
Datasets: CIFAR-10, ImageNet (vision), GLUE, WikiText-103 (NLP)
Data Preprocessing
Feature extraction from hidden layers
Calculation of cosine similarity for layer-wise analysis
Aligned Training Approach
Common classifier design
Cross-entropy loss minimization
Integration with residual connections
Analysis Techniques
Neural collapse phenomenon
Efficiency evaluation: model complexity and computational requirements
Experiments and Results
Layer-Wise Similarity
Visualization of cosine similarity patterns across layers
Distance-dependent similarity decline observed
Multi-Exit Model Performance
Improved accuracy with aligned training
Comparison with shallow-layer architectures
Task-Specific Outcomes
Enhanced performance on vision tasks (CIFAR-10, ImageNet)
Boosted NLP tasks (GLUE, WikiText-103)
Complexity and Efficiency Reduction
Analysis of model complexity savings
Computational benefits of aligned training
Conclusion
Significance of aligned training for representation alignment
Implications for model design and efficiency
Future directions for research in transformer model analysis
Basic info
papers
computation and language
computer vision and pattern recognition
machine learning
artificial intelligence
Advanced features
Insights
How does cosine similarity relate to layer-wise similarity in the investigated models?
How does aligned training impact the efficiency and complexity of multi-exit models compared to shallow-layer architectures?
What does the paper focus on in terms of transformer model representations?
What is the proposed method, 'aligned training,' and its purpose in enhancing model performance?