Related papers: Think Twice: Branch-and-Rethink Reasoning Reward Model

Think Twice: Branch-and-Rethink Reasoning Reward Model

URL: http://arxiv.org/abs/2510.23596v1
Date: Mon, 27 Oct 2025 17:58:07 GMT
Title: Think Twice: Branch-and-Rethink Reasoning Reward Model
Authors: Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau,
Abstract summary: We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling.<n>We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks.<n>By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors.
Score: 32.70732791642558
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.

Related papers

Recursive Think-Answer Process for LLMs and VLMs [54.52289112197118]
We propose an efficient Recursive Think-Answer Process (R-TAP)<n>R-TAP enables models to engage in iterative reasoning cycles and generate more accurate answers.<n>We show that R-TAP-enhanced models consistently outperform conventional single-pass methods.
arXiv Detail & Related papers (2026-03-02T17:20:10Z)
CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation [43.85448261466922]
We propose an end-to-end attack framework named Contradiction-Based Deliberation Extension (CODE)<n> CODE develops a multi-agent architecture to construct poisoning samples that are injected into the knowledge base.<n>Experiments show that CODE causes a 5.32x-24.72x increase in reasoning token consumption, without degrading task performance.
arXiv Detail & Related papers (2026-01-19T14:52:31Z)
Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z)
Dual-Weighted Reinforcement Learning for Generative Preference Modeling [61.443461640955796]
We propose Dual-Weighted Reinforcement Learning (DWRL) as a new framework for preference modeling.<n>In this paper, we apply DWRL to preference modeling by training generative preference models (GPMs) to first generate a thought and then predict the human preference score.<n>Our results position DWRL as a general framework for reasoning-enhanced preference learning beyond verifiable tasks.
arXiv Detail & Related papers (2025-10-17T02:14:24Z)
Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models [33.398631680508814]
We propose Answer-Consistent Reinforcement Learning that modifies the GRPO algorithm with an auxiliary consistency check.<n>We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct.<n>We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement.
arXiv Detail & Related papers (2025-10-11T08:32:52Z)
Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection [85.0189917888094]
We propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework to handle the challenges posed by subtle and infrequent mistakes.<n>The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances.
arXiv Detail & Related papers (2025-09-16T12:00:42Z)
Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models [28.756240721942138]
Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning.<n>We propose Thinking with Nothinking (JointThinking), a new ICL paradigm that prompts the model to generate two answers in parallel.<n>JointThinking significantly outperforms few-shot chain-of-thought (CoT), thinking twice and majority voting.
arXiv Detail & Related papers (2025-08-05T12:09:55Z)
Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z)
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning [31.727984223052648]
This paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model.<n>We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o.<n>We then prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks.
arXiv Detail & Related papers (2025-05-06T08:46:41Z)
RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z)
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [70.77691645678804]
We present the first successful replication of emergent characteristics for multimodal reasoning on only a non-SFT 2B model.<n>Our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately 30% and exceeding both SFT setting by 2%.<n>In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models.
arXiv Detail & Related papers (2025-03-07T04:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.