Related papers: Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards

Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards

URL: http://arxiv.org/abs/2510.23083v1
Date: Mon, 27 Oct 2025 07:36:41 GMT
Title: Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
Authors: Jan Niklas Groeneveld, Xi Qin, Alexander Schaefer, Yaad Oren,
Abstract summary: This paper investigates whether state-of-the-art small language models can be turned into usable reward models.<n>We construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark.<n>Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
Score: 40.23960862004138
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.

Related papers

BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards [25.999630323726464]
BaNEL is an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs)<n>We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks.
arXiv Detail & Related papers (2025-10-10T17:55:03Z)
Activation Reward Models for Few-Shot Model Alignment [77.37511364793515]
We introduce Activation Reward Models (Activation RMs)<n>Activation RMs leverage activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning.<n>We demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications.
arXiv Detail & Related papers (2025-07-02T05:10:29Z)
GRAM: A Generative Foundation Reward Model for Reward Generalization [48.63394690265176]
We develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning.<n>This model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning.
arXiv Detail & Related papers (2025-06-17T04:34:27Z)
Self-Correcting Code Generation Using Small Language Models [20.68323406228016]
Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement.<n>We introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction.<n>With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.
arXiv Detail & Related papers (2025-05-29T04:04:44Z)
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference [27.205035058481553]
We propose assigning scores to every sentence, introducing an intermediate-grained reward model.<n>A novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score.<n>Our method outperforms the response-level reward model by 2.7% on RewardBench.
arXiv Detail & Related papers (2025-03-01T14:11:04Z)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)<n>We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.<n>We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z)
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.