Related papers: DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

URL: http://arxiv.org/abs/2512.24165v1
Date: Tue, 30 Dec 2025 11:51:18 GMT
Title: DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models
Authors: Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng,
Abstract summary: We introduce DiffThinker, a generative multimodal reasoning framework.<n>We show it can achieve superior logical consistency and spatial precision in vision-centric tasks.
Score: 40.38351627330629
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

Related papers

PRISM: A Principled Framework for Multi-Agent Reasoning via Gain Decomposition [42.31805270016533]
Multi-agent collaboration has emerged as a promising paradigm for enhancing reasoning capabilities of Large Language Models (LLMs)<n>Existing approaches remain largely, lacking principled guidance on what drives performance gains and how to systematically optimize multi-agent reasoning.<n>We introduce a unified theoretical framework that decomposes multi-agent reasoning gains into three conceptually independent dimensions.
arXiv Detail & Related papers (2026-02-09T12:24:56Z)
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning [76.95203056566191]
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought.<n>We build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement.<n>ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.
arXiv Detail & Related papers (2025-10-30T17:51:38Z)
Simple o3: Towards Interleaved Vision-Language Reasoning [38.46230601239066]
We propose Simple o3, an end-to-end framework that integrates dynamic tool interactions into interleaved vision-language reasoning.<n>Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains.<n> Experimental results demonstrate Simple o3's superior performance on diverse benchmarks, outperforming existing approaches.
arXiv Detail & Related papers (2025-08-16T17:15:39Z)
GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking [35.14983424309319]
We present GThinker, a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science.<n>GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively reinterprets these cues to resolve inconsistencies.<n>To support the training, we construct GThinker-11K, comprising 7K high-quality, iteratively-annotated reasoning paths and 4K curated reinforcement learning samples.
arXiv Detail & Related papers (2025-06-01T16:28:26Z)
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models [45.15161506154318]
Infi-MMR is a framework to systematically unlock the reasoning potential of Multimodal Small Language Models.<n>The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities.<n>The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts.<n>The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning.
arXiv Detail & Related papers (2025-05-29T04:51:56Z)
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models [79.52467430114805]
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains.<n>In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior.<n>Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities.
arXiv Detail & Related papers (2025-05-08T03:35:23Z)
Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs [59.66595230543127]
Conceptual diagrams externalize mental models, abstracting irrelevant details to efficiently capture how entities interact.<n>Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text.<n>We propose Visual Thinking, a generalizable framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
arXiv Detail & Related papers (2025-03-14T18:27:02Z)
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z)
Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.