Related papers: Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

URL: http://arxiv.org/abs/2511.22826v2
Date: Tue, 02 Dec 2025 23:10:59 GMT
Title: Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
Authors: Tianle Chen, Chaitanya Chakka, Arjun Reddy Akula, Xavier Thomas, Deepti Ghadiyaram,
Abstract summary: MMA-Bench comprises videos and tasks that probe a model's reliance on specific modalities.<n>We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text.<n>We propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues.
Score: 5.380090638488105
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.

Related papers

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval? [8.45007357012084]
We investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers.<n>Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics.<n>We find that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance.
arXiv Detail & Related papers (2025-12-22T07:36:20Z)
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints [100.02131897927484]
This paper focuses on the native training of Multimodal Large Language Models (MLLMs) in an end-to-end manner.<n>We propose a native MLLM called NaViL, combined with a simple and cost-effective recipe.<n> Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs.
arXiv Detail & Related papers (2025-10-09T17:59:37Z)
Empowering Multimodal LLMs with External Tools: A Comprehensive Survey [61.66069828956139]
Multimodal Large Language Models (MLLMs) have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence.<n>Lack of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols hinder the reliability and broader applicability of MLLMs.<n>Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools offers a promising strategy to overcome these challenges.
arXiv Detail & Related papers (2025-08-14T07:25:45Z)
MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning [10.602434753538535]
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence.<n>Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models.<n>We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube.
arXiv Detail & Related papers (2025-06-28T19:44:32Z)
MLLMs are Deeply Affected by Modality Bias [158.64371871084478]
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images.<n>MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs.<n>This paper argues that MLLMs are deeply affected by modality bias, highlighting its manifestations across various tasks.
arXiv Detail & Related papers (2025-05-24T11:49:31Z)
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering.<n> tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z)
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs [7.03771340666549]
Vision-language misalignment in Multimodal Large Language Models (MLLMs) is a critical challenge.<n>We propose MapleLeaf AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens.<n>Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios.
arXiv Detail & Related papers (2025-03-04T13:18:33Z)
Position: Empowering Time Series Reasoning with Multimodal LLMs [49.73647759532127]
We argue that multimodal language models (MLLMs) can enable more powerful and flexible reasoning for time series analysis.<n>We call on researchers and practitioners to leverage this potential by developing strategies that prioritize trust, interpretability, and robust reasoning in MLLMs.
arXiv Detail & Related papers (2025-02-03T16:10:48Z)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z)
Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective [9.633811630889237]
We propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. We introduce a novel dataset with 12,000 challenging VQA instances requiring multi-hop reasoning. Our experiments show that MLLMs perform poorly on MORE, indicating strong unimodal biases and limited semantic understanding.
arXiv Detail & Related papers (2024-03-27T08:38:49Z)
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM) While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested. We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z)
A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.<n>This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.