Semantic Concentration for Self-Supervised Dense Representations Learning
- URL: http://arxiv.org/abs/2509.09429v1
- Date: Thu, 11 Sep 2025 13:12:10 GMT
- Title: Semantic Concentration for Self-Supervised Dense Representations Learning
- Authors: Peisong Wen, Qianqian Xu, Siran Dai, Runmin Cong, Qingming Huang,
- Abstract summary: Image-level self-supervised learning (SSL) has made significant progress, yet learning dense representations for patches remains challenging.<n>This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration.
- Score: 103.10708947415092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.
Related papers
- Self-Supervised Learning from Structural Invariance [6.07374214141791]
We study the one-to-many mapping problem in joint-embedding self-supervised learning (SSL)<n>We show that existing methods struggle to flexibly capture this conditional uncertainty.<n>We empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.
arXiv Detail & Related papers (2026-02-02T17:44:44Z) - Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space [7.8904984750896885]
Deep neural networks are prone to learning shortcuts, spurious and easily learned correlations.<n>We present a simple and effective training method that renders the classifier functionally invariant to shortcut signals.<n>We analyze this as targeted Jacobian regularization, which forces the classifier to ignore spurious features and rely on more complex, core semantic signals.
arXiv Detail & Related papers (2025-11-24T07:09:08Z) - A Contrastive Learning Foundation Model Based on Perfectly Aligned Sample Pairs for Remote Sensing Images [18.191222010916405]
We present a novel self-supervised method called PerA, which produces all-purpose Remote Sensing features through semantically Perfectly Aligned sample pairs.<n>Our framework provides high-quality features by ensuring consistency between teacher and student.<n>We collect an unlabeled pre-training dataset, which contains about 5 million RS images.
arXiv Detail & Related papers (2025-05-26T03:12:49Z) - Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z) - SEER-ZSL: Semantic Encoder-Enhanced Representations for Generalized Zero-Shot Learning [0.6792605600335813]
Zero-Shot Learning (ZSL) presents the challenge of identifying categories not seen during training.<n>We introduce a Semantic-Enhanced Representations for Zero-Shot Learning (SEER-ZSL)<n>First, we aim to distill meaningful semantic information using a probabilistic encoder, enhancing the semantic consistency and robustness.<n>Second, we distill the visual space by exploiting the learned data distribution through an adversarially trained generator. Third, we align the distilled information, enabling a mapping of unseen categories onto the true data manifold.
arXiv Detail & Related papers (2023-12-20T15:18:51Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - Boosting Few-shot Fine-grained Recognition with Background Suppression
and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples.
We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric.
Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z) - Constrained Mean Shift for Representation Learning [17.652439157554877]
We develop a non-contrastive representation learning method that can exploit additional knowledge.
Our main idea is to generalize the mean-shift algorithm by constraining the search space of nearest neighbors.
We show that it is possible to use the noisy constraint across modalities to train self-supervised video models.
arXiv Detail & Related papers (2021-10-19T23:14:23Z) - Information Bottleneck Constrained Latent Bidirectional Embedding for
Zero-Shot Learning [59.58381904522967]
We propose a novel embedding based generative model with a tight visual-semantic coupling constraint.
We learn a unified latent space that calibrates the embedded parametric distributions of both visual and semantic spaces.
Our method can be easily extended to transductive ZSL setting by generating labels for unseen images.
arXiv Detail & Related papers (2020-09-16T03:54:12Z) - Density-Aware Graph for Deep Semi-Supervised Visual Recognition [102.9484812869054]
Semi-supervised learning (SSL) has been extensively studied to improve the generalization ability of deep neural networks for visual recognition.
This paper proposes to solve the SSL problem by building a novel density-aware graph, based on which the neighborhood information can be easily leveraged.
arXiv Detail & Related papers (2020-03-30T02:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.