Related papers: The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

URL: http://arxiv.org/abs/2511.13016v1
Date: Mon, 17 Nov 2025 06:15:26 GMT
Title: The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training
Authors: Subramanyam Sahoo,
Abstract summary: We study hard, continuous, and hybrid reward structures for fine-tuning large language models.<n>We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals.<n>Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.

Related papers

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling [49.41422138354821]
We propose a principled reward modeling framework that integrates non-negative factor analysis into the Bradley-Terry preference model.<n>BNRM represents rewards through a sparse, non-negative latent factor generative process.<n>We show that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
arXiv Detail & Related papers (2026-02-11T08:14:11Z)
Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation [46.38008143057758]
Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable.<n>This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment.<n>Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges.
arXiv Detail & Related papers (2026-02-10T00:45:24Z)
Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLMs [51.29260537017623]
Large Language Models (LLMs) are emerging as versatile foundation models for computational chemistry.<n>These models often lack round-trip consistency.<n>We introduce Round-Trip Reinforcement Learning (RTRL), a novel framework that trains a model to improve its consistency.
arXiv Detail & Related papers (2025-10-01T23:58:58Z)
Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models [27.0496567592082]
The development of alignment and reasoning capabilities in large language models has seen remarkable progress.<n>The third paradigm of applying RLHF to distillation-trained models presents significant challenges.<n>We propose Balanced Actor Initialization (BAI), a two-stage weighted model merging approach.
arXiv Detail & Related papers (2025-08-30T01:53:25Z)
Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z)
Reward-Robust RLHF in LLMs [25.31456438114974]
Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence. The reliance on reward-model-based (RM-based) alignment methods introduces significant challenges. We introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges.
arXiv Detail & Related papers (2024-09-18T02:35:41Z)
HAF-RM: A Hybrid Alignment Framework for Reward Model Training [51.59246299566669]
We propose a hybrid alignment framework HaF-RM for reward model training.<n>It offers a principled and effective approach to enhancing the performance and alignment of reward models.
arXiv Detail & Related papers (2024-07-04T23:26:56Z)
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging [65.41765072566287]
We propose textbfDomain knowledtextbfge merged textbfReward textbfModel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging.
arXiv Detail & Related papers (2024-07-01T17:01:54Z)
Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.