Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes
- URL: http://arxiv.org/abs/2510.22836v1
- Date: Sun, 26 Oct 2025 21:06:13 GMT
- Title: Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes
- Authors: Guanyu Yao, Qiucheng Wu, Yang Zhang, Zhaowen Wang, Handong Zhao, Shiyu Chang,
- Abstract summary: Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks.<n>Recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities.<n>We refer to this phenomenon as the textitmodality gap, defined as the performance disparity between text-centric and vision-centric inputs.
- Score: 54.374410871041164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities. Specifically, current MLLMs often over-rely on textual cues while under-attending to visual content, resulting in suboptimal performance on tasks that require genuine visual reasoning. We refer to this phenomenon as the \textit{modality gap}, defined as the performance disparity between text-centric and vision-centric inputs. In this paper, we analyze the modality gap through the lens of training recipes. We first show that existing training recipes tend to amplify this gap. Then, we systematically explore strategies to bridge it from two complementary perspectives: data and loss design. Our findings provide insights into developing training recipes that mitigate the modality gap and promote more balanced multimodal reasoning. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Bridging-Modality-Gap.
Related papers
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - Monet: Reasoning in Latent Visual Space Beyond Images and Language [55.424507246294326]
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning.<n>Existing methods fall short of human-like abstract visual thinking.<n>We introduce Monet, a training framework that enables multimodal large language models to reason directly within the latent visual space.
arXiv Detail & Related papers (2025-11-26T13:46:39Z) - What do vision-language models see in the context? Investigating multimodal in-context learning [2.1119217917006234]
In-context learning (ICL) enables Large Language Models to learn tasks from demonstration examples without parameter updates.<n>We present a systematic study of ICL in Vision-Language Models (VLMs)<n>We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL.
arXiv Detail & Related papers (2025-10-28T11:55:24Z) - True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - Improving Fine-grained Visual Understanding in VLMs through Text-Only Training [0.0]
We investigate the feasibility of enhancing fine-grained visual understanding in Visual-Language Models (VLMs) through text-only training.<n>We conduct comprehensive experiments on two distinct domains: fine-grained species classification and cultural visual understanding tasks.<n>Our findings demonstrate that text-only training can be comparable to conventional image-text training while significantly reducing computational costs.
arXiv Detail & Related papers (2024-12-17T14:18:50Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only.
Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.