Related papers: Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

URL: http://arxiv.org/abs/2510.08492v1
Date: Thu, 09 Oct 2025 17:32:23 GMT
Title: Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
Authors: Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, Phillip Isola,
Abstract summary: We introduce: Unpaired Multimodal, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them.<n>We show that using unpaired data from auxiliary modalities consistently improves downstream performance across diverse unimodal targets such as image and audio.
Score: 63.032359320629105
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/

Related papers

Learning Shared Representations from Unpaired Data [8.370305493567542]
We show that shared representations can be learned almost exclusively from unpaired data.<n> Empirical results in computer vision and natural language processing domains support its potential.
arXiv Detail & Related papers (2025-05-23T11:13:04Z)
MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.<n>We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.<n>With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z)
MultiDelete for Multimodal Machine Unlearning [14.755831733659699]
MultiDelete is designed to decouple associations between unimodal data points during unlearning. It can maintain the multimodal and unimodal knowledge of the original model post unlearning. It can provide better protection to unlearned data against adversarial attacks.
arXiv Detail & Related papers (2023-11-18T08:30:38Z)
SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets [30.262094419776208]
Current approaches assume that the source data is available during adaptation and that the source consists of paired multi-modal data. We propose a switching framework which automatically chooses between two complementary methods of cross-modal pseudo-label fusion. Our method achieves an improvement in mIoU of up to 12% over competing baselines.
arXiv Detail & Related papers (2023-08-23T02:57:58Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data. Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds. We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z)
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis. For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)
Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis [0.0]
We show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M.
arXiv Detail & Related papers (2022-06-09T21:30:10Z)
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.