Related papers: Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities

Related papers

Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes [54.374410871041164]
Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks.<n>Recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities.<n>We refer to this phenomenon as the textitmodality gap, defined as the performance disparity between text-centric and vision-centric inputs.
arXiv Detail & Related papers (2025-10-26T21:06:13Z)
Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space [66.76138204796497]
Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer.<n>We propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space.<n>Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches.
arXiv Detail & Related papers (2025-10-14T14:58:25Z)
Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward [87.06604760273372]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories. These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT) VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks. We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z)
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [60.08485416687596]
Chain of Multi-modal Thought (CoMT) benchmark aims to mimic human-like reasoning that inherently integrates visual operation.<n>We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches.
arXiv Detail & Related papers (2024-12-17T14:10:16Z)
Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination [13.706325901731665]
Multimodal large language models (MLLMs) have advanced the integration of visual and linguistic modalities. Current approaches like chain of thought (CoT) reasoning have augmented the cognitive capabilities of large language models (LLMs) But their adaptation to MLLMs is hindered by heightened risks of hallucination in cross-modality comprehension.
arXiv Detail & Related papers (2024-11-15T21:01:37Z)
ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom [42.03770972100087]
We introduce a novel visual reasoning framework named ProReason. ProReason features multi-run proactive perception and decoupled vision-reasoning capabilities. Our experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods.
arXiv Detail & Related papers (2024-10-18T03:22:06Z)
Enhancing Advanced Visual Reasoning Ability of Large Language Models [20.32900494896848]
Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning. We propose Complex Visual Reasoning Large Language Models (CVR-LLM) Our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop. We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs' contextual understanding and reasoning.
arXiv Detail & Related papers (2024-09-21T02:10:19Z)
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning [15.296263261737026]
We introduce a Multi-Image MIRB Benchmark to evaluate visual language models' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. We demonstrate that while open-source VLMs were shown to approach the GPT-4V in single-image tasks, a significant gap remains in multi-image reasoning tasks.
arXiv Detail & Related papers (2024-06-18T16:02:18Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective [9.633811630889237]
We propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. We introduce a novel dataset with 12,000 challenging VQA instances requiring multi-hop reasoning. Our experiments show that MLLMs perform poorly on MORE, indicating strong unimodal biases and limited semantic understanding.
arXiv Detail & Related papers (2024-03-27T08:38:49Z)
RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video. Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z)
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models [34.91372939329467]
We introduce a benchmark, NPHardEval4V, to evaluate the pure reasoning abilities of MLLMs. Our findings reveal significant discrepancies in reasoning abilities across different models. We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-03-04T07:10:31Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z)
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models [10.429515738164342]
Causal Context Generation, Causal-CoG, is a prompting strategy that engages contextual information to enhance precise VQA during inference. We run extensive experiments on 10 multimodal benchmarks and show consistent improvements. We hope Casual-CoG inspires explorations of context knowledge in multimodal models, and serves as a plug-and-play strategy for decoding.
arXiv Detail & Related papers (2023-12-09T08:44:41Z)
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.