Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis
- URL: http://arxiv.org/abs/2503.15060v3
- Date: Tue, 20 May 2025 08:24:32 GMT
- Title: Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis
- Authors: Imanol G. Estepa, Jesús M. Rodríguez-de-Vera, Ignacio Sarasúa, Bhalaji Nagarajan, Petia Radeva,
- Abstract summary: Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between representation learning and generative modeling.<n>Recent Unified SSL methods rely solely on semantic token reconstruction, which requires an external tokenizer during training.<n>We introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective.
- Score: 3.5900418884504095
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.
Related papers
- Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think [56.539823627694304]
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models.<n>We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations.<n>We propose Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising.
arXiv Detail & Related papers (2025-07-02T08:29:18Z) - A Contrastive Learning Foundation Model Based on Perfectly Aligned Sample Pairs for Remote Sensing Images [18.191222010916405]
We present a novel self-supervised method called PerA, which produces all-purpose Remote Sensing features through semantically Perfectly Aligned sample pairs.<n>Our framework provides high-quality features by ensuring consistency between teacher and student.<n>We collect an unlabeled pre-training dataset, which contains about 5 million RS images.
arXiv Detail & Related papers (2025-05-26T03:12:49Z) - Cross-Modal Consistency Learning for Sign Language Recognition [90.76641997060513]
We propose a Cross-modal Consistency Learning framework (CCL-SLR) for Isolated Sign Language Recognition.<n>CCL- SLR learns from both RGB and pose modalities based on self-supervised pre-training.<n>It achieves impressive performance, demonstrating its effectiveness.
arXiv Detail & Related papers (2025-03-16T12:34:07Z) - Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards [52.90573877727541]
reinforcement learning (RL) has been considered for diffusion model fine-tuning.
RL's effectiveness is limited by the challenge of sparse reward.
$textB2text-DiffuRL$ is compatible with existing optimization algorithms.
arXiv Detail & Related papers (2025-03-14T09:45:19Z) - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.<n>We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.<n>The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - Contrastive Learning with Synthetic Positives [11.932323457691945]
Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques.
In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (NCLP)
NCLP utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives.
arXiv Detail & Related papers (2024-08-30T01:47:43Z) - LeOCLR: Leveraging Original Images for Contrastive Learning of Visual Representations [4.680881326162484]
Contrastive instance discrimination methods outperform supervised learning in downstream tasks such as image classification and object detection.
A common augmentation technique in contrastive learning is random cropping followed by resizing.
We introduce LeOCLR, a framework that employs a novel instance discrimination approach and an adapted loss function.
arXiv Detail & Related papers (2024-03-11T15:33:32Z) - Hallucination Improves the Performance of Unsupervised Visual
Representation Learning [9.504503675097137]
We propose Hallucinator that could efficiently generate additional positive samples for further contrast.
The Hallucinator is differentiable and creates new data in the feature space.
Remarkably, we empirically prove that the proposed Hallucinator generalizes well to various contrastive learning models.
arXiv Detail & Related papers (2023-07-22T21:15:56Z) - MSR: Making Self-supervised learning Robust to Aggressive Augmentations [98.6457801252358]
We propose a new SSL paradigm, which counteracts the impact of semantic shift by balancing the role of weak and aggressively augmented pairs.
We show that our model achieves 73.1% top-1 accuracy on ImageNet-1K with ResNet-50 for 200 epochs, which is a 2.5% improvement over BYOL.
arXiv Detail & Related papers (2022-06-04T14:27:29Z) - Crafting Better Contrastive Views for Siamese Representation Learning [20.552194081238248]
We propose ContrastiveCrop, which could effectively generate better crops for Siamese representation learning.
A semantic-aware object localization strategy is proposed within the training process in a fully unsupervised manner.
As a plug-and-play and framework-agnostic module, ContrastiveCrop consistently improves SimCLR, MoCo, BYOL, SimSiam by 0.4% 2.0% classification accuracy.
arXiv Detail & Related papers (2022-02-07T15:09:00Z) - QK Iteration: A Self-Supervised Representation Learning Algorithm for
Image Similarity [0.0]
We present a new contrastive self-supervised representation learning algorithm in the context of Copy Detection in the 2021 Image Similarity Challenge hosted by Facebook AI Research.
Our algorithms achieved a micro-AP score of 0.3401 on the Phase 1 leaderboard, significantly improving over the baseline $mu$AP of 0.1556.
arXiv Detail & Related papers (2021-11-15T18:01:05Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Whitening for Self-Supervised Representation Learning [129.57407186848917]
We propose a new loss function for self-supervised representation learning (SSL) based on the whitening of latent-space features.
Our solution does not require asymmetric networks and it is conceptually simple.
arXiv Detail & Related papers (2020-07-13T12:33:25Z) - Un-Mix: Rethinking Image Mixtures for Unsupervised Visual Representation
Learning [108.999497144296]
Recently advanced unsupervised learning approaches use the siamese-like framework to compare two "views" from the same image for learning representations.
This work aims to involve the distance concept on label space in the unsupervised learning and let the model be aware of the soft degree of similarity between positive or negative pairs.
Despite its conceptual simplicity, we show empirically that with the solution -- Unsupervised image mixtures (Un-Mix), we can learn subtler, more robust and generalized representations from the transformed input and corresponding new label space.
arXiv Detail & Related papers (2020-03-11T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.