Related papers: MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

URL: http://arxiv.org/abs/2602.07790v1
Date: Sun, 08 Feb 2026 03:07:36 GMT
Title: MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training
Authors: Wanyun Xie, Francesco Tonin, Volkan Cevher,
Abstract summary: MaD-Mix is a principled framework that derives multi-modal data mixtures for VLM training.<n>MaD-Mix speeds VLM training across diverse benchmarks.<n>In complex tri-modal video-image-text scenarios, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture overhead.
Score: 54.78779514101305
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.

Related papers

Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization [38.78268216433473]
We study model merging as an efficient strategy for estimating the performance of different data mixtures.<n>We conduct experiments on 14 multimodal benchmarks, and empirically demonstrate that the proxy models exhibit a high rank correlation with models trained on actual data mixtures.
arXiv Detail & Related papers (2026-02-04T16:06:39Z)
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning [37.71233459623324]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs)<n>Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks.<n>We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation.
arXiv Detail & Related papers (2025-05-30T17:59:38Z)
MMBind: Unleashing the Potential of Distributed and Heterogeneous Data for Multimodal Learning in IoT [11.884646027921173]
We propose MMBind, a new data binding approach for multimodal learning on distributed and heterogeneous IoT data.<n>We show that MMBind outperforms state-of-the-art baselines under varying degrees of data incompleteness and domain shift.
arXiv Detail & Related papers (2024-11-18T23:34:07Z)
No Need to Talk: Asynchronous Mixture of Language Models [25.3581396758015]
Smalltalk LM is an innovative method for training a mixture of language models in an almost asynchronous manner.<n>At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix.<n>Experiments on language modeling demonstrate that SMALLTALK LM achieves significantly lower perplexity than dense model baselines.
arXiv Detail & Related papers (2024-10-04T15:50:10Z)
MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding [64.65145700121442]
We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding. Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder. We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios.
arXiv Detail & Related papers (2024-05-28T18:44:15Z)
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.<n>We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.<n>Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z)
PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches. PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z)
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training [42.142924806184425]
Mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
arXiv Detail & Related papers (2023-12-14T12:02:24Z)
Learning with MISELBO: The Mixture Cookbook [62.75516608080322]
We present the first ever mixture of variational approximations for a normalizing flow-based hierarchical variational autoencoder (VAE) with VampPrior and a PixelCNN decoder network. We explain this cooperative behavior by drawing a novel connection between VI and adaptive importance sampling. We obtain state-of-the-art results among VAE architectures in terms of negative log-likelihood on the MNIST and FashionMNIST datasets.
arXiv Detail & Related papers (2022-09-30T15:01:35Z)
Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data. In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM) DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.