Related papers: GRIT: Teaching MLLMs to Think with Images

GRIT: Teaching MLLMs to Think with Images

URL: http://arxiv.org/abs/2505.15879v1
Date: Wed, 21 May 2025 17:54:49 GMT
Title: GRIT: Teaching MLLMs to Think with Images
Authors: Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang,
Abstract summary: Grounded Reasoning with Images and Texts (GRIT) is a novel method for training MLLMs to think with images.<n>GRIT generates reasoning chains that interleave natural language and explicit bounding box coordinates.<n>GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets.
Score: 22.74533687444133
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

Related papers

Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z)
Knowledge Retrieval in LLM Gaming: A Shift from Entity-Centric to Goal-Oriented Graphs [6.636092764694501]
Large Language Models (LLMs) demonstrate impressive general capabilities but often struggle with step-by-step reasoning, especially in complex applications such as games.<n>We propose a novel framework based on Goal-Oriented Graphs (GoGs), where each node represents a goal and its associated attributes, and edges encode logical dependencies between goals.<n>Our method significantly enhances the reasoning ability of LLMs in game-playing tasks, as demonstrated by extensive experiments on the Minecraft testbed, outperforming GraphRAG and other baselines.
arXiv Detail & Related papers (2025-05-24T09:09:20Z)
Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation [75.9865035064794]
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information.<n>Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system.<n>We propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase.
arXiv Detail & Related papers (2025-05-22T05:15:27Z)
VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z)
Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs [4.701165676405066]
It is critical not only to retrieve relevant information but also to provide causal reasoning and explainability.<n>This paper proposes a novel pipeline that filters large knowledge graphs to emphasize cause-effect edges.<n> Experiments on medical question-answering tasks show consistent gains, with up to a 10% absolute improvement.
arXiv Detail & Related papers (2025-01-24T19:31:06Z)
Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning [73.2950349728376]
Large language models (LLMs) have demonstrated remarkable success across a wide range of tasks.<n>However, they still encounter challenges in reasoning tasks that require understanding and inferring relationships between pieces of information.<n>This challenge is particularly pronounced in tasks involving multi-step processes, such as logical reasoning and multi-hop question answering.<n>We propose Reasoning with Graphs (RwG) by first constructing explicit graphs from the context.
arXiv Detail & Related papers (2025-01-14T05:18:20Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models [83.28737898989694]
Large language models (LLMs) struggle with faithful reasoning due to knowledge gaps and hallucinations. We introduce graph-constrained reasoning (GCR), a novel framework that bridges structured knowledge in KGs with unstructured reasoning in LLMs. GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training.
arXiv Detail & Related papers (2024-10-16T22:55:17Z)
Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models [33.662269036173456]
Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. Knowledge Graph Question Answering (KGQA) serves as a critical touchstone for the integration. We propose an interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG)
arXiv Detail & Related papers (2024-09-05T01:11:58Z)
Dual Reasoning: A GNN-LLM Collaborative Framework for Knowledge Graph Question Answering [38.31983923708175]
We propose Dual-Reasoning, a novel framework that integrates an external system based on Graph Neural Network (GNN) for explicit reasoning on Knowledge Graphs (KGs)<n>We show that DualR achieves state-of-the-art performance while maintaining high efficiency and interpretability.
arXiv Detail & Related papers (2024-06-03T09:38:28Z)
Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning [104.92384929827776]
Large language models (LLMs) have demonstrated impressive reasoning abilities in complex tasks. They lack up-to-date knowledge and experience hallucinations during reasoning. Knowledge graphs (KGs) offer a reliable source of knowledge for reasoning.
arXiv Detail & Related papers (2023-10-02T10:14:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.