Information-Maximized Soft Variable Discretization for Self-Supervised Image Representation Learning
- URL: http://arxiv.org/abs/2501.03469v1
- Date: Tue, 07 Jan 2025 02:10:52 GMT
- Title: Information-Maximized Soft Variable Discretization for Self-Supervised Image Representation Learning
- Authors: Chuang Niu, Wenjun Xia, Hongming Shan, Ge Wang,
- Abstract summary: IMSVD softly discretizes each variable in the latent space.
We propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features.
- Score: 20.530066199565745
- License:
- Abstract: Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today's vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features. We then derive a joint-cross entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at https://github.com/niuchuangnn/IMSVD.
Related papers
- Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent [72.1517476116743]
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets.
Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue.
We introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation forgetting.
We propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations.
arXiv Detail & Related papers (2025-02-17T12:26:34Z) - From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs [23.011836329934255]
Vision Dynamic Embedding-Guided Pretraining (VDEP) is a hybrid autoregressive training paradigm for MLLMs.
The proposed method seamlessly integrates into standard models without architectural changes.
Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.
arXiv Detail & Related papers (2025-02-13T09:04:28Z) - Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching [0.8611782340880084]
This study proposes an innovative visual semantic embedding model, Multi-Headed Consensus-Aware Visual-Semantic Embedding (MH-CVSE)
This model introduces a multi-head self-attention mechanism based on the consensus-aware visual semantic embedding model (CVSE) to capture information in multiple subspaces in parallel.
In terms of loss function design, the MH-CVSE model adopts a dynamic weight adjustment strategy to dynamically adjust the weight according to the loss value itself.
arXiv Detail & Related papers (2024-12-26T11:46:22Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - The Common Stability Mechanism behind most Self-Supervised Learning
Approaches [64.40701218561921]
We provide a framework to explain the stability mechanism of different self-supervised learning techniques.
We discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO.
We formulate different hypotheses and test them using the Imagenet100 dataset.
arXiv Detail & Related papers (2024-02-22T20:36:24Z) - USIM-DAL: Uncertainty-aware Statistical Image Modeling-based Dense
Active Learning for Super-resolution [47.38982697349244]
Dense regression is a widely used approach in computer vision for tasks such as image super-resolution, enhancement, depth estimation, etc.
We propose incorporating active learning into dense regression models to address this problem.
Active learning allows models to select the most informative samples for labeling, reducing the overall annotation cost while improving performance.
arXiv Detail & Related papers (2023-05-27T16:33:43Z) - ArCL: Enhancing Contrastive Learning with Augmentation-Robust
Representations [30.745749133759304]
We develop a theoretical framework to analyze the transferability of self-supervised contrastive learning.
We show that contrastive learning fails to learn domain-invariant features, which limits its transferability.
Based on these theoretical insights, we propose a novel method called Augmentation-robust Contrastive Learning (ArCL)
arXiv Detail & Related papers (2023-03-02T09:26:20Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Visual Adversarial Imitation Learning using Variational Models [60.69745540036375]
Reward function specification remains a major impediment for learning behaviors through deep reinforcement learning.
Visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents.
We develop a variational model-based adversarial imitation learning algorithm.
arXiv Detail & Related papers (2021-07-16T00:15:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.