Related papers: Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment

Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment

URL: http://arxiv.org/abs/2406.05766v1
Date: Sun, 9 Jun 2024 12:41:14 GMT
Title: Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment
Authors: Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Jiangbin Zheng, Kaicheng yu, Wanyu Chen, Stan Z. Li,
Abstract summary: Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. In various specialized fields, it is struggling to obtain sufficient alignment data for the training process. We transform semi-supervised multimodal alignment into a manifold matching problem and propose a new method based on CLIP, named Gentle-CLIP.
Score: 38.38565939595014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. However, in various specialized fields, it is struggling to obtain sufficient alignment data for the training process, which seriously limits the use of previously elegant models. Thus, semi-supervised learning attempts to achieve multimodal alignment with fewer matched pairs but traditional methods like pseudo-labeling are difficult to apply in domains with no label information. To address these problems, we transform semi-supervised multimodal alignment into a manifold matching problem and propose a new method based on CLIP, named Gentle-CLIP. Specifically, we design a novel semantic density distribution loss to explore implicit semantic alignment information from unpaired multimodal data by constraining the latent representation distribution with fine granularity, thus eliminating the need for numerous strictly matched pairs. Meanwhile, we introduce multi-kernel maximum mean discrepancy as well as self-supervised contrastive loss to pull separate modality distributions closer and enhance the stability of the representation distribution. In addition, the contrastive loss used in CLIP is employed on the supervised matched data to prevent negative optimization. Extensive experiments conducted on a range of tasks in various fields, including protein, remote sensing, and the general vision-language field, demonstrate the effectiveness of our proposed Gentle-CLIP.

Related papers

Principled Multimodal Representation Learning [70.60542106731813]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z)
Continual Multimodal Contrastive Learning [70.60542106731813]
Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z)
Learning to Match Unpaired Data with Minimum Entropy Coupling [7.399561232927219]
Minimum Entropy Coupling seeks to minimize the joint Entropy, while satisfying constraints on the marginals. We propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks.
arXiv Detail & Related papers (2025-03-11T14:54:14Z)
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence [83.15764564701706]
We propose a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz divergence with mutual information. In the proposed framework, we find that the CS divergence and mutual information serve complementary roles in multimodal alignment, capturing both the global distribution information of each modality and the pairwise semantic relationships. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.
arXiv Detail & Related papers (2025-02-24T10:29:15Z)
Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios. Existing SHGL methods encounter two significant limitations. We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z)
Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference [20.761803725098005]
Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number of inference networks for all possible modality combinations. We introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework.
arXiv Detail & Related papers (2024-10-15T08:49:38Z)
Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences [25.73415065546444]
Key challenge in unaligned multimodal language sequences is to integrate information from various modalities to obtain a refined multimodal joint representation. We propose a Mutual Information-based Representations Disentanglement (MIRD) method for unaligned multimodal language sequences.
arXiv Detail & Related papers (2024-09-19T02:12:26Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Continual Vision-Language Representation Learning with Off-Diagonal Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z)
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger [30.758184720183106]
We propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2%.
arXiv Detail & Related papers (2023-03-30T17:27:22Z)
Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains. We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Adaptive Affinity Loss and Erroneous Pseudo-Label Refinement for Weakly Supervised Semantic Segmentation [48.294903659573585]
In this paper, we propose to embed affinity learning of multi-stage approaches in a single-stage model. A deep neural network is used to deliver comprehensive semantic information in the training phase. Experiments are conducted on the PASCAL VOC 2012 dataset to evaluate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2021-08-03T07:48:33Z)
Weakly supervised segmentation with cross-modality equivariant constraints [7.757293476741071]
Weakly supervised learning has emerged as an appealing alternative to alleviate the need for large labeled datasets in semantic segmentation. We present a novel learning strategy that leverages self-supervision in a multi-modal image scenario to significantly enhance original CAMs. Our approach outperforms relevant recent literature under the same learning conditions.
arXiv Detail & Related papers (2021-04-06T13:14:20Z)
Learning Diverse Representations for Fast Adaptation to Distribution Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task. We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.