Related papers: Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

URL: http://arxiv.org/abs/2505.20753v1
Date: Tue, 27 May 2025 05:50:25 GMT
Title: Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Authors: Yufei Zhan, Hongyin Zhao, Yousong Zhu, Shurong Zheng, Fan Yang, Ming Tang, Jinqiao Wang,
Abstract summary: Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks.<n>We present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems.<n>Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers.
Score: 26.14137626882127
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at https://github.com/jefferyZhan/Griffon/tree/master/Griffon-R soon.

Related papers

Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering [11.271123465926301]
Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering.<n>We propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions.<n>Experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs.
arXiv Detail & Related papers (2025-06-01T03:15:29Z)
VoQA: Visual-only Question Answering [7.251596370310251]
We propose Visual-only Question Answering (VoQA), a novel multimodal task in which questions are visually embedded within images.<n>This requires models to locate, recognize, and reason over visually embedded textual questions.<n>We introduce Guided Response Triggering Supervised Fine-tuning (GRT-SFT), a structured fine-tuning strategy that guides the model to perform step-by-step reasoning purely based on visual input.
arXiv Detail & Related papers (2025-05-20T11:37:49Z)
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning [125.79428219851289]
Inst-IT is a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning.<n>Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm.
arXiv Detail & Related papers (2024-12-04T18:58:10Z)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. These models have been shown to be highly capable, but also lacking some basic visual understanding skills. This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models [14.765057045747753]
Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks. We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step. IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs.
arXiv Detail & Related papers (2024-05-22T17:56:51Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. Lumen first promotes fine-grained vision-language concept alignment. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z)
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z)
Look, Remember and Reason: Grounded reasoning in videos with language models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.