Related papers: Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

URL: http://arxiv.org/abs/2501.02669v2
Date: Mon, 02 Jun 2025 16:48:29 GMT
Title: Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?
Authors: Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, Sanjeev Arora,
Abstract summary: Vision Language Models (VLMs) are impressive at visual question answering and image captioning.<n>But they underperform on multi-step visual reasoning, giving rise to perceptions of modality imbalance or brittleness.<n>We introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning.
Score: 48.41029452721923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning -- even compared to LLMs on the same tasks presented in text form -- giving rise to perceptions of modality imbalance or brittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.

Related papers

Seeing to Generalize: How Visual Data Corrects Binding Shortcuts [5.724899979571379]
Vision Language Models can outperform their underlying Large Language Models on purely text-only tasks.<n>We show that visual training changes the model's internal binding strategy.<n>Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
arXiv Detail & Related papers (2026-02-16T20:43:12Z)
Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes [54.374410871041164]
Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks.<n>Recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities.<n>We refer to this phenomenon as the textitmodality gap, defined as the performance disparity between text-centric and vision-centric inputs.
arXiv Detail & Related papers (2025-10-26T21:06:13Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning [20.632248864242968]
We show that language-only models can achieve comparable or even better performance than MLLMs that consume raw visual inputs.<n>Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications.<n>Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning.
arXiv Detail & Related papers (2025-06-11T13:39:46Z)
Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning [49.07442840323135]
We propose a new paradigm for perception-oriented instruction tuning, i.e., Q-Adapt. Our proposed Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating comparable performance.
arXiv Detail & Related papers (2025-04-02T12:02:57Z)
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM) Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model. We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks. They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison. We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z)
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses [31.85977999591524]
Vision-Language Models (VLMs) achieved strong performance on a variety of tasks (e.g., image-text retrieval, visual question answering)<n>We propose HIerarchically STructured Learning (HIST) that enhances VLM training without any additional supervision.
arXiv Detail & Related papers (2024-12-11T05:36:18Z)
Discriminative Fine-tuning of LVLMs [67.14293827774827]
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning.<n>We propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs.
arXiv Detail & Related papers (2024-12-05T17:54:27Z)
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z)
Large Vision-Language Models as Emotion Recognizers in Context Awareness [14.85890824622433]
Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues. Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images. This paper systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task.
arXiv Detail & Related papers (2024-07-16T01:28:06Z)
Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training [24.989732666940153]
Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs. MLLMs still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. We propose a two-step training pipeline VCAR, which emphasizes the Visual Reasoning training in addition to mathematical learning.
arXiv Detail & Related papers (2024-04-22T21:59:35Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs) Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z)
Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration. Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents. The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.