Learning to Think Fast and Slow for Visual Language Models
- URL: http://arxiv.org/abs/2511.16670v1
- Date: Thu, 20 Nov 2025 18:59:48 GMT
- Title: Learning to Think Fast and Slow for Visual Language Models
- Authors: Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou,
- Abstract summary: We propose a simple RL approach, which enables visual language models to switch between fast and slow thinking modes depending on task difficulty.<n>Our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models.
- Score: 29.91277432114863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.
Related papers
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice [88.93674345138054]
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks.<n>We propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy.
arXiv Detail & Related papers (2026-01-08T18:00:59Z) - The Virtues of Brevity: Avoid Overthinking in Parallel Test-Time Reasoning [0.7874708385247352]
We show that the simple and counterintuitive of selecting the shortest solution is highly effective.<n>We confirm that this approach is competitive with complex methods such as self-consistency.
arXiv Detail & Related papers (2025-10-24T00:47:17Z) - Fast Thinking for Large Language Models [67.7238685892317]
We introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors.<n>At inference, the model conditions on a handful of continuous thinking switches distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens.
arXiv Detail & Related papers (2025-09-28T04:19:48Z) - Controlling Thinking Speed in Reasoning Models [57.14541748751654]
Human cognition operates in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking.<n>In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment.<n>Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance.
arXiv Detail & Related papers (2025-07-04T16:41:06Z) - DynamicMind: A Tri-Mode Thinking System for Large Language Models [28.327075192324234]
DynamicMind is a novel tri-mode thinking system for large language models.<n>It autonomously selects between Fast, Normal, and Slow thinking modes for zero-shot question answering tasks.<n>It achieves superior ZSQA capabilities while establishing an effective trade-off between performance and computational efficiency.
arXiv Detail & Related papers (2025-06-06T10:02:13Z) - The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models [54.88805865447848]
We show that instruct models achieve higher efficiency overall, and problem difficulty affects efficiency.<n>We propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it.<n>On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
arXiv Detail & Related papers (2025-05-28T06:24:45Z) - Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [67.87579664988199]
TON is a two-stage training strategy for vision-language models (VLMs)<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z) - Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning [75.04643265875072]
Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking.<n>Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization.<n>ACPO enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch.
arXiv Detail & Related papers (2025-05-22T07:15:08Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging [17.038807261969033]
Long-to-Short (L2S) reasoning aims to balance reasoning depth with practical efficiency.<n>Model merging offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models.<n>Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance.
arXiv Detail & Related papers (2025-03-26T15:34:37Z) - SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces [11.462550020102935]
We propose a novel self-distillation framework for Vision-Language Models.<n>We employ a prompt library tailored to visual reasoning tasks to generate diverse in-context questions.<n>We then utilize a two-step reasoning procedure to derive reasoning-guided responses.<n>These responses are then used for self-distillation, enabling the model to internalize the reasoning process.
arXiv Detail & Related papers (2025-03-03T17:24:42Z) - DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [42.95876831743256]
Large language models (LLMs) have demonstrated emergent capabilities across diverse reasoning tasks via Chains-of-Thought prompting.
This paper addresses the challenge of enabling LLMs to autonomously select between fast and slow inference methods.
We introduce a dynamic decision-making framework that categorizes tasks into two distinct pathways: 'Fast', designated for tasks where the LLM quickly identifies a high-confidence solution, and 'Slow', allocated for tasks that the LLM perceives as complex.
arXiv Detail & Related papers (2024-07-01T06:45:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.