R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
- URL: http://arxiv.org/abs/2508.21113v2
- Date: Tue, 02 Sep 2025 13:37:38 GMT
- Title: R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
- Authors: Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng, Jie Jiang,
- Abstract summary: Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks.<n>It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models.
- Score: 38.74501281986792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization (BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.
Related papers
- Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models [4.064135211977999]
Large language models (LLMs) and vision-language models (LVLMs) struggle with complex, multi-step, cross-modal common sense reasoning tasks.<n>We propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs' common sense reasoning capabilities.<n>CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors.
arXiv Detail & Related papers (2025-08-04T20:33:58Z) - KAT-V1: Kwai-AutoThink Technical Report [50.84483585850113]
We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks.<n>KAT dynamically switches between reasoning and non-reasoning modes based on task complexity.<n>We also propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework.
arXiv Detail & Related papers (2025-07-11T04:07:10Z) - SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning [45.28220409043598]
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, but struggle with complex problems requiring explicit self-reflection and self-correction.<n>Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback.<n>We propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning framework.
arXiv Detail & Related papers (2025-06-02T14:21:44Z) - Think Only When You Need with Large Hybrid-Reasoning Models [121.55211364358662]
Large Hybrid-Reasoning Models (LHRMs)<n>First kind of model capable of adaptively determining whether to perform thinking based on contextual information of user queries.<n>Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type.
arXiv Detail & Related papers (2025-05-20T17:23:25Z) - Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL [19.731871225975926]
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers.<n>To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities.<n>We propose AutoThink, a multi-stage reinforcement learning framework that progressively optimize reasoning policies.
arXiv Detail & Related papers (2025-05-16T04:01:57Z) - Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging [17.038807261969033]
Long-to-Short (L2S) reasoning aims to balance reasoning depth with practical efficiency.<n>Model merging offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models.<n>Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance.
arXiv Detail & Related papers (2025-03-26T15:34:37Z) - BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.