Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering
- URL: http://arxiv.org/abs/2506.00806v1
- Date: Sun, 01 Jun 2025 03:15:29 GMT
- Title: Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering
- Authors: Songtao Jiang, Chenyi Zhou, Yan Zhang, Yeying Jin, Zuozhu Liu,
- Abstract summary: Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering.<n>We propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions.<n>Experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs.
- Score: 11.271123465926301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM. For straightforward questions, FOCUS supports efficient zero-shot reasoning. For more complex tasks, it employs the conceptualizing before observation strategy to highlight critical elements. Extensive experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs, achieving significant gains across all datasets. Ablation studies further validate the importance of combining diverse cognitive strategies with refined visual information for superior performance. Code will be released.
Related papers
- Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z) - Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs [61.64185573373394]
We propose a training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal.<n>We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data.<n>Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.
arXiv Detail & Related papers (2025-10-01T09:20:51Z) - Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents [60.095739427926524]
Long videos, characterized by temporal and sparse task-relevant information, pose significant reasoning challenges for AI systems.<n>Inspired by human progressive visual cognition, we propose CogniGPT for efficient and reliable long video understanding.
arXiv Detail & Related papers (2025-09-29T15:42:55Z) - Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z) - MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models [7.702194892874595]
Multi-Perspective Contextual Augmentation for Reasoning (MPCAR) is a novel inference-time strategy designed to enhance Large Vision-Language Models (LVLMs)<n>MPCAR operates in three stages: first, an LVLM generates N diverse and complementary descriptions or preliminary reasoning paths from various angles; second, these descriptions are intelligently integrated with the original question to construct a comprehensive context-augmented prompt; and finally, this enriched prompt guides the ultimate LVLM for deep reasoning and final answer generation.
arXiv Detail & Related papers (2025-08-17T15:25:01Z) - Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models [26.14137626882127]
Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks.<n>We present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems.<n>Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers.
arXiv Detail & Related papers (2025-05-27T05:50:25Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency [29.28977802424541]
We introduce VCBENCH, a benchmark for multimodal mathematical reasoning with explicit visual dependencies.<n> VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning.<n>We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy.
arXiv Detail & Related papers (2025-04-24T06:16:38Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT)<n>VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks.<n>We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance.
Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low.
We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z) - BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
BloomWise is a cognitively-inspired prompting technique for large language models (LLMs)<n>It is designed to enhance LLMs' performance on mathematical problem solving while making their solutions more explainable.
arXiv Detail & Related papers (2024-10-05T09:27:52Z) - Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - Visual Agents as Fast and Slow Thinkers [88.1404921693082]
We introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents.<n>FaST employs a switch adapter to dynamically select between System 1/2 modes.<n>It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data.
arXiv Detail & Related papers (2024-08-16T17:44:02Z) - Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models [14.765057045747753]
Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks.
We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step.
IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs.
arXiv Detail & Related papers (2024-05-22T17:56:51Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.