Related papers: Near, far: Patch-ordering enhances vision foundation models' scene understanding

Near, far: Patch-ordering enhances vision foundation models' scene understanding

URL: http://arxiv.org/abs/2408.11054v2
Date: Tue, 11 Feb 2025 14:15:13 GMT
Title: Near, far: Patch-ordering enhances vision foundation models' scene understanding
Authors: Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, Yuki M. Asano,
Abstract summary: NeCo: Patch Neighbor Consistency enforces patch-level nearest neighbor consistency across a student and teacher model.<n>Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal.<n>This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU.
Score: 35.768260232640756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e., 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view consistency on SPair-71k, by more than 1.5%.

Related papers

Pseudolabel guided pixels contrast for domain adaptive semantic segmentation [0.9831489366502301]
Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. We propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods.
arXiv Detail & Related papers (2025-01-15T03:25:25Z)
No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations [30.9134119244757]
FUNGI is a method to enhance the features of transformer encoders by leveraging self-supervised gradients. Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input. The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio.
arXiv Detail & Related papers (2024-07-15T17:58:42Z)
Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence [80.6840060272386]
This paper identifies the importance of being geometry-aware for semantic correspondence. We show that incorporating this information can markedly enhance semantic correspondence performance. Our method achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset.
arXiv Detail & Related papers (2023-11-28T18:45:13Z)
A Simplified Framework for Contrastive Learning for Node Representations [2.277447144331876]
We investigate the potential of deploying contrastive learning in combination with Graph Neural Networks for embedding nodes in a graph. We show that the quality of the resulting embeddings and training time can be significantly improved by a simple column-wise postprocessing of the embedding matrix. This modification yields improvements in downstream classification tasks of up to 1.5% and even beats existing state-of-the-art approaches on 6 out of 8 different benchmarks.
arXiv Detail & Related papers (2023-05-01T02:04:36Z)
Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction [25.67957712837716]
We introduce a novel method based on a hierarchy built on 1-nearest neighbor graphs in the original space. The proposal is an optimization-free projection that is competitive with the latest versions of t-SNE and UMAP. In the paper, we argue about the soundness of the proposed method and evaluate it on a diverse collection of datasets with sizes varying from 1K to 11M samples and dimensions from 28 to 16K.
arXiv Detail & Related papers (2022-03-24T11:41:16Z)
Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space. We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z)
With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations [87.72779294717267]
Using the nearest-neighbor as positive in contrastive losses improves performance significantly on ImageNet classification. We demonstrate empirically that our method is less reliant on complex data augmentations.
arXiv Detail & Related papers (2021-04-29T17:56:08Z)
Improving Calibration for Long-Tailed Recognition [68.32848696795519]
We propose two methods to improve calibration and performance in such scenarios. For dataset bias due to different samplers, we propose shifted batch normalization. Our proposed methods set new records on multiple popular long-tailed recognition benchmark datasets.
arXiv Detail & Related papers (2021-04-01T13:55:21Z)
Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper. Our method makes use of information from both intra- and inter-images. It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z)
CoMatch: Semi-supervised Learning with Contrastive Graph Regularization [86.84486065798735]
CoMatch is a new semi-supervised learning method that unifies dominant approaches. It achieves state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2020-11-23T02:54:57Z)
Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z)
Contrastive Multi-View Representation Learning on Graphs [13.401746329218017]
We introduce a self-supervised approach for learning node and graph level representations by contrasting structural views of graphs. We achieve new state-of-the-art results in self-supervised learning on 8 out of 8 node and graph classification benchmarks.
arXiv Detail & Related papers (2020-06-10T00:49:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.