Small Reward Models via Backward Inference
- URL: http://arxiv.org/abs/2602.13551v1
- Date: Sat, 14 Feb 2026 01:55:39 GMT
- Title: Small Reward Models via Backward Inference
- Authors: Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov,
- Abstract summary: FLIP (FLipped Inference for Prompt Reconstruction) is a reference-free and rubric-free reward modeling approach.<n>It reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response.
- Score: 100.59075794599768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP.
Related papers
- Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling [49.41422138354821]
We propose a principled reward modeling framework that integrates non-negative factor analysis into the Bradley-Terry preference model.<n>BNRM represents rewards through a sparse, non-negative latent factor generative process.<n>We show that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
arXiv Detail & Related papers (2026-02-11T08:14:11Z) - Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry [41.26991813225211]
We investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation.<n>We propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation.<n>We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations.
arXiv Detail & Related papers (2026-01-30T05:34:24Z) - GDRO: Group-level Reward Post-training Suitable for Diffusion Models [55.948229011478304]
Group-level rewards successfully align the model with the targeted reward.<n>Group-level Direct Reward Optimization (GDRO) is a new post-training paradigm for group-level reward alignment.<n>GDRO supports full offline training that saves the large time cost for image rollout sampling.<n>It is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtainity.
arXiv Detail & Related papers (2026-01-05T11:47:18Z) - RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance [101.30279597148973]
We propose reward-free guidance (RFG) for guiding the reasoning trajectory of dLLMs without explicit process reward.<n>RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%.
arXiv Detail & Related papers (2025-09-29T23:59:16Z) - Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries [3.930598942647121]
We propose a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction.<n>In both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals.
arXiv Detail & Related papers (2025-08-25T17:11:28Z) - Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations.
Current approaches focus on using reinforcement learning with human feedback to improve model alignment.
We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.