NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency
Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, Yuki M. Asano·August 20, 2024
Summary
NeCo, a novel self-supervised learning method, improves spatial representations in DINOv2 by enforcing patch-level nearest neighbor consistency. This is achieved through a training loss that ensures consistency between a student and teacher model relative to reference batches. By applying a differentiable sorting method on top of pretrained representations, NeCo leverages dense post-pretraining to enhance model performance across various models and datasets. Despite requiring only 19 hours on a single GPU, NeCo generates high-quality dense feature encoders, setting new state-of-the-art results in non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, and linear segmentation evaluations on COCO-Things and -Stuff. Inspired by non-parametric methods that mimic large language models, NeCo uses nearest-neighbor retrieval as a training mechanism to cluster and distinguish intricate visual similarities, aiming to create models with deeply semantic spatial features for in-context tasks.
NeCo is a post-pretraining adaptation that applies a dense, patch-sorting-based self-supervised objective to any pretrained Vision Transformer. This method addresses the discrete nature of nearest-neighbor retrieval by using a differentiable sorting method to backpropagate gradients, resulting in more efficient and effective algorithms. NeCo is demonstrated to be useful when applied to six different backbones and evaluated on four datasets and four evaluation protocols, achieving performance gains from 6% to 16%. The method sets several new state-of-the-art performances, particularly on the in-context segmentation benchmark, outperforming previous methods such as CrIbo and DINOv2 on Pascal VOC and ADE20k by 4% to 13% across different metrics.
NeCo aims to develop a feature space where patches representing the same object have similar features, while patches representing different objects have distinct features. This is crucial for self-supervised methods. The method extracts dense features from input images, dividing them into patches, and then uses a Vision Transformer architecture for feature extraction. The teacher and student models are employed in a teacher-student framework, with the teacher's weights updated using the exponential moving average of the student's weights. To align features, ROI-Align is applied, adjusted according to the crop augmentation parameters. Pairwise distances between features are computed using cosine similarity, and differentiable sorting is used to force consistent neighbor ordering across views. This ensures more robust and meaningful feature representations, enhancing the method's performance in applications requiring accurate object representation.
NeCo computes distances between features using cosine similarities in two views, with formulas (1) and (2). The computed distance matrices are then sorted in a differentiable manner to ensure similar sorting across both views. Traditional sorting algorithms are non-differentiable, so a method is used to propagate gradients by comparing elements and swapping their positions in the sequence based on their values. This allows for optimization during training in machine learning models.
NeCo significantly improves performance on various datasets compared to other models, boosting the accuracy of models initialized with different techniques. It achieves this by creating cluster maps for each image, matching them with ground truth using Hungarian matching, and reporting mIoU scores. The method surpasses state-of-the-art models like CrIBo and DINO by 14.5% on average across different datasets and metrics. In a linear semantic segmentation evaluation, NeCo outperforms CrIBo and DINOv2 by at least 10% and 5% to 7%, respectively. NeCo's effectiveness is consistent across various self-supervised learning initializations, improving performance by 4% to 30% across different metrics and datasets.
NeCo's experimental setup involves a dense post-pretraining implementation framework in Python, utilizing Torch and PyTorch Lightning. The pretraining datasets consist of COCO and ImageNet-100, with data augmentations including random color-jitter, Gaussian blur, grayscale, and multi-crop. The backbone employs vision transformers, specifically ViT-Small and ViT-Base, with a student-teacher setup and teacher weights updated by the exponential moving average of student weights. The evaluation setup follows the Dense Nearest Neighbor Retrieval Evaluation, assessing scene understanding capabilities of a dense image encoder.
NeCo's visualizations demonstrate its ability to retrieve more relevant and precise nearest patches compared to DINOv2R. The method's performance is validated through computational analysis, and it is shown to be efficient, requiring only 2.5 GPU hours when applied to CrIBo. When used with TimeT, NeCo surpasses CrIBo with nearly 30% less total training time. The method's effectiveness is further supported by clustering and overclustering approaches, which successfully assign unique cluster IDs to detected objects and accurately sketch their boundaries.
In conclusion, NeCo is a self-supervised learning method that improves spatial representations in DINOv2 by enforcing patch-level nearest neighbor consistency. It leverages dense post-pretraining to enhance model performance across various models and datasets, setting new state-of-the-art results in non-parametric in-context semantic segmentation and linear segmentation evaluations. NeCo's effectiveness is demonstrated through its superior performance compared to previous methods, making it a valuable contribution to the field of self-supervised learning and computer vision.
Introduction
Background
Overview of self-supervised learning methods
Importance of spatial representations in computer vision tasks
Objective
Objective of NeCo in improving spatial representations
Methodology and key components of NeCo
Method
Data Collection
Pretraining datasets (COCO, ImageNet-100)
Data augmentations (color-jitter, Gaussian blur, grayscale, multi-crop)
Data Preprocessing
Vision Transformer backbone selection (ViT-Small, ViT-Base)
Student-teacher framework setup
Teacher weights update (exponential moving average)
Training Loss
Patch-level nearest neighbor consistency enforced
Teacher-student model relative to reference batches
Differentiable Sorting
Overcoming non-differentiability in nearest-neighbor retrieval
Gradient propagation through differentiable sorting
Evaluation
Datasets and Protocols
Four datasets and four evaluation protocols
Performance benchmarks against previous methods
Results
State-of-the-art performances on ADE20k, Pascal VOC, COCO-Things, COCO-Stuff
Improvements over CrIbo and DINOv2 by 4% to 13% across different metrics
Implementation
Framework
Dense post-pretraining in Python
Utilization of Torch and PyTorch Lightning
Computational Analysis
Efficiency of NeCo (2.5 GPU hours for CrIBo)
Comparison with TimeT (30% less training time)
Visualizations
Retrieval of relevant and precise nearest patches
Comparison with DINOv2R
Clustering and Overclustering
Unique cluster IDs assignment to detected objects
Accurate sketching of object boundaries
Conclusion
Summary of NeCo's contributions to self-supervised learning
Future directions and potential applications
Basic info
papers
computer vision and pattern recognition
artificial intelligence
Advanced features