Near, far: Patch-ordering enhances vision foundation models' scene understanding
- URL: http://arxiv.org/abs/2408.11054v2
- Date: Tue, 11 Feb 2025 14:15:13 GMT
- Title: Near, far: Patch-ordering enhances vision foundation models' scene understanding
- Authors: Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, Yuki M. Asano,
- Abstract summary: NeCo: Patch Neighbor Consistency enforces patch-level nearest neighbor consistency across a student and teacher model.
Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal.
This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU.
- Score: 35.768260232640756
- License:
- Abstract: We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e., 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view consistency on SPair-71k, by more than 1.5%.
Related papers
- Pseudolabel guided pixels contrast for domain adaptive semantic segmentation [0.9831489366502301]
Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels.
Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique.
We propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods.
arXiv Detail & Related papers (2025-01-15T03:25:25Z) - No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations [30.9134119244757]
FUNGI is a method to enhance the features of transformer encoders by leveraging self-supervised gradients.
Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input.
The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio.
arXiv Detail & Related papers (2024-07-15T17:58:42Z) - Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space.
We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z) - With a Little Help from My Friends: Nearest-Neighbor Contrastive
Learning of Visual Representations [87.72779294717267]
Using the nearest-neighbor as positive in contrastive losses improves performance significantly on ImageNet classification.
We demonstrate empirically that our method is less reliant on complex data augmentations.
arXiv Detail & Related papers (2021-04-29T17:56:08Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - CoMatch: Semi-supervised Learning with Contrastive Graph Regularization [86.84486065798735]
CoMatch is a new semi-supervised learning method that unifies dominant approaches.
It achieves state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2020-11-23T02:54:57Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z) - Contrastive Multi-View Representation Learning on Graphs [13.401746329218017]
We introduce a self-supervised approach for learning node and graph level representations by contrasting structural views of graphs.
We achieve new state-of-the-art results in self-supervised learning on 8 out of 8 node and graph classification benchmarks.
arXiv Detail & Related papers (2020-06-10T00:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.