Related papers: Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View

Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View

URL: http://arxiv.org/abs/2406.05766v2
Date: Sat, 21 Sep 2024 09:50:33 GMT
Title: Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View
Authors: Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Kaicheng yu, Wanyu Chen, Miaoyu Wang, Stan Z. Li,
Abstract summary: Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. In many specialized fields, it is struggling to obtain sufficient alignment data for training. We propose a new methodology based on CLIP, termed Set-CLIP.
Score: 35.389116270077324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which seriously limits the use of previously effective models. Therefore, semi-supervised learning approaches are attempted to facilitate multimodal alignment by learning from low-alignment data with fewer matched pairs, but traditional techniques like pseudo-labeling may run into troubles in the label-deficient scenarios. To tackle these challenges, we reframe semi-supervised multimodal alignment as a manifold matching issue and propose a new methodology based on CLIP, termed Set-CLIP. Specifically, by designing a novel semantic density distribution loss, we constrain the latent representation distribution with fine granularity and extract implicit semantic alignment from unpaired multimodal data, thereby reducing the reliance on numerous strictly matched pairs. Furthermore, we apply coarse-grained modality adaptation and unimodal self-supervised guidance to narrow the gaps between modality spaces and improve the stability of representation distributions. Extensive experiments conducted on a range of tasks in various fields, including protein analysis, remote sensing, and the general vision-language field, validate the efficacy of our proposed Set-CLIP method. Especially with no paired data for supervised training, Set-CLIP is still outstanding, which brings an improvement of 144.83% over CLIP.

Related papers

Principled Multimodal Representation Learning [70.60542106731813]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z)
Continual Multimodal Contrastive Learning [70.60542106731813]
Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z)
Learning to Match Unpaired Data with Minimum Entropy Coupling [7.399561232927219]
Minimum Entropy Coupling seeks to minimize the joint Entropy, while satisfying constraints on the marginals. We propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks.
arXiv Detail & Related papers (2025-03-11T14:54:14Z)
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence [83.15764564701706]
We propose a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz divergence with mutual information. In the proposed framework, we find that the CS divergence and mutual information serve complementary roles in multimodal alignment, capturing both the global distribution information of each modality and the pairwise semantic relationships. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.
arXiv Detail & Related papers (2025-02-24T10:29:15Z)
Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios. Existing SHGL methods encounter two significant limitations. We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z)
Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference [20.761803725098005]
Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number of inference networks for all possible modality combinations. We introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework.
arXiv Detail & Related papers (2024-10-15T08:49:38Z)
Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences [25.73415065546444]
Key challenge in unaligned multimodal language sequences is to integrate information from various modalities to obtain a refined multimodal joint representation. We propose a Mutual Information-based Representations Disentanglement (MIRD) method for unaligned multimodal language sequences.
arXiv Detail & Related papers (2024-09-19T02:12:26Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Continual Vision-Language Representation Learning with Off-Diagonal Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z)
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger [30.758184720183106]
We propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2%.
arXiv Detail & Related papers (2023-03-30T17:27:22Z)
Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains. We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Adaptive Affinity Loss and Erroneous Pseudo-Label Refinement for Weakly Supervised Semantic Segmentation [48.294903659573585]
In this paper, we propose to embed affinity learning of multi-stage approaches in a single-stage model. A deep neural network is used to deliver comprehensive semantic information in the training phase. Experiments are conducted on the PASCAL VOC 2012 dataset to evaluate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2021-08-03T07:48:33Z)
Weakly supervised segmentation with cross-modality equivariant constraints [7.757293476741071]
Weakly supervised learning has emerged as an appealing alternative to alleviate the need for large labeled datasets in semantic segmentation. We present a novel learning strategy that leverages self-supervision in a multi-modal image scenario to significantly enhance original CAMs. Our approach outperforms relevant recent literature under the same learning conditions.
arXiv Detail & Related papers (2021-04-06T13:14:20Z)
Learning Diverse Representations for Fast Adaptation to Distribution Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task. We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.