The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models
- URL: http://arxiv.org/abs/2503.03122v2
- Date: Mon, 10 Mar 2025 02:34:53 GMT
- Title: The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models
- Authors: Zichao Li, Xueru Wen, Jie Lou, Yuqiu Ji, Yaojie Lu, Xianpei Han, Debing Zhang, Le Sun,
- Abstract summary: Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences.<n> MM-RMs often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations.<n>We introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples.
- Score: 31.81567038783558
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.
Related papers
- PAL: Prompting Analytic Learning with Missing Modality for Multi-Modal Class-Incremental Learning [42.00851701431368]
Multi-modal class-incremental learning (MMCIL) seeks to leverage multi-modal data, such as audio-visual and image-text pairs.
A critical challenge remains: the issue of missing modalities during incremental learning phases.
We propose PAL, a novel exemplar-free framework tailored to MMCIL under missing-modality scenarios.
arXiv Detail & Related papers (2025-01-16T08:04:04Z) - Asymmetric Reinforcing against Multi-modal Representation Bias [59.685072206359855]
We propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM)<n>Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information.<n>We have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
arXiv Detail & Related papers (2025-01-02T13:00:06Z) - Multimodal Fusion Balancing Through Game-Theoretic Regularization [22.959030061257533]
We show that current balancing methods struggle to train multimodal models that surpass even simple baselines, such as ensembles.<n>This raises the question: how can we ensure that all modalities in multimodal training are sufficiently trained, and that learning from new modalities consistently improves performance?<n>This paper proposes the Multimodal Competition Regularizer (MCR), a new loss component inspired by mutual information (MI) decomposition.
arXiv Detail & Related papers (2024-11-11T19:53:05Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model [3.012719451477384]
We introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions.
MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data.
The robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.
arXiv Detail & Related papers (2024-04-29T13:23:33Z) - Cross-Modal Prototype based Multimodal Federated Learning under Severely
Missing Modality [31.727012729846333]
Multimodal Federated Cross Prototype Learning (MFCPL) is a novel approach for MFL under severely missing modalities.
MFCPL provides diverse modality knowledge in modality-shared level with the cross-modal regularization and modality-specific level with cross-modal contrastive mechanism.
Our approach introduces the cross-modal alignment to provide regularization for modality-specific features, thereby enhancing overall performance.
arXiv Detail & Related papers (2024-01-25T02:25:23Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning
with Hierarchical Aggregation [16.308470947384134]
HA-Fedformer is a novel transformer-based model that empowers unimodal training with only a unimodal dataset at the client.
We develop an uncertainty-aware aggregation method for the local encoders with layer-wise Markov Chain Monte Carlo sampling.
Our experiments on popular sentiment analysis benchmarks, CMU-MOSI and CMU-MOSEI, demonstrate that HA-Fedformer significantly outperforms state-of-the-art multimodal models.
arXiv Detail & Related papers (2023-03-27T07:07:33Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.