Related papers: Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering

Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering

URL: http://arxiv.org/abs/2506.00806v1
Date: Sun, 01 Jun 2025 03:15:29 GMT
Title: Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering
Authors: Songtao Jiang, Chenyi Zhou, Yan Zhang, Yeying Jin, Zuozhu Liu,
Abstract summary: Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering.<n>We propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions.<n>Experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs.
Score: 11.271123465926301
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM. For straightforward questions, FOCUS supports efficient zero-shot reasoning. For more complex tasks, it employs the conceptualizing before observation strategy to highlight critical elements. Extensive experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs, achieving significant gains across all datasets. Ablation studies further validate the importance of combining diverse cognitive strategies with refined visual information for superior performance. Code will be released.

Related papers

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models [26.14137626882127]
Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks.<n>We present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems.<n>Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers.
arXiv Detail & Related papers (2025-05-27T05:50:25Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency [29.28977802424541]
We introduce VCBENCH, a benchmark for multimodal mathematical reasoning with explicit visual dependencies.<n> VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning.<n>We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy.
arXiv Detail & Related papers (2025-04-24T06:16:38Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT)<n>VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks.<n>We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance. Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low. We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z)
BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
BloomWise is a cognitively-inspired prompting technique for large language models (LLMs)<n>It is designed to enhance LLMs' performance on mathematical problem solving while making their solutions more explainable.
arXiv Detail & Related papers (2024-10-05T09:27:52Z)
Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. We present QL-Bench, a benchmark settings to simulate human responses to low-level vision. We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z)
Visual Agents as Fast and Slow Thinkers [88.1404921693082]
We introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents.<n>FaST employs a switch adapter to dynamically select between System 1/2 modes.<n>It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data.
arXiv Detail & Related papers (2024-08-16T17:44:02Z)
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models [14.765057045747753]
Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks. We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step. IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs.
arXiv Detail & Related papers (2024-05-22T17:56:51Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images. We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.