Related papers: R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

URL: http://arxiv.org/abs/2508.21113v2
Date: Tue, 02 Sep 2025 13:37:38 GMT
Title: R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Authors: Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng, Jie Jiang,
Abstract summary: Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks.<n>It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models.
Score: 38.74501281986792
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization (BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

Related papers

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z)
Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models [4.064135211977999]
Large language models (LLMs) and vision-language models (LVLMs) struggle with complex, multi-step, cross-modal common sense reasoning tasks.<n>We propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs' common sense reasoning capabilities.<n>CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors.
arXiv Detail & Related papers (2025-08-04T20:33:58Z)
KAT-V1: Kwai-AutoThink Technical Report [50.84483585850113]
We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks.<n>KAT dynamically switches between reasoning and non-reasoning modes based on task complexity.<n>We also propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework.
arXiv Detail & Related papers (2025-07-11T04:07:10Z)
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning [45.28220409043598]
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, but struggle with complex problems requiring explicit self-reflection and self-correction.<n>Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback.<n>We propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning framework.
arXiv Detail & Related papers (2025-06-02T14:21:44Z)
Think Only When You Need with Large Hybrid-Reasoning Models [121.55211364358662]
Large Hybrid-Reasoning Models (LHRMs)<n>First kind of model capable of adaptively determining whether to perform thinking based on contextual information of user queries.<n>Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type.
arXiv Detail & Related papers (2025-05-20T17:23:25Z)
Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL [19.731871225975926]
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers.<n>To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities.<n>We propose AutoThink, a multi-stage reinforcement learning framework that progressively optimize reasoning policies.
arXiv Detail & Related papers (2025-05-16T04:01:57Z)
Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging [17.038807261969033]
Long-to-Short (L2S) reasoning aims to balance reasoning depth with practical efficiency.<n>Model merging offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models.<n>Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance.
arXiv Detail & Related papers (2025-03-26T15:34:37Z)
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.