BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards
- URL: http://arxiv.org/abs/2510.09596v1
- Date: Fri, 10 Oct 2025 17:55:03 GMT
- Title: BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards
- Authors: Sangyun Lee, Brandon Amos, Giulia Fanti,
- Abstract summary: BaNEL is an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs)<n>We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks.
- Score: 25.999630323726464
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Today's generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.
Related papers
- Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards [40.23960862004138]
This paper investigates whether state-of-the-art small language models can be turned into usable reward models.<n>We construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark.<n>Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
arXiv Detail & Related papers (2025-10-27T07:36:41Z) - GRAM: A Generative Foundation Reward Model for Reward Generalization [48.63394690265176]
We develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning.<n>This model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning.
arXiv Detail & Related papers (2025-06-17T04:34:27Z) - Intention-Conditioned Flow Occupancy Models [80.42634994902858]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret [64.04721528586747]
We show that a sufficiently low expected test error of the reward model guarantees low worst-case regret.<n>We then show that similar problems persist even when using policy regularization techniques.
arXiv Detail & Related papers (2024-06-22T06:43:51Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [32.752633250862694]
Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data.
We introduce a new framework, Reward rAnked FineTuning, designed to align generative models effectively.
arXiv Detail & Related papers (2023-04-13T18:22:40Z) - RewardsOfSum: Exploring Reinforcement Learning Rewards for Summarisation [7.0471949371778795]
We propose two reward functions for the task of abstractive summarisation.
The first function, referred to as RwB-Hinge, dynamically selects the samples for the gradient update.
The second function, nicknamed RISK, leverages a small pool of strong candidates to inform the reward.
arXiv Detail & Related papers (2021-06-08T03:30:50Z) - Exposing Shallow Heuristics of Relation Extraction Models with Challenge
Data [49.378860065474875]
We identify failure modes of SOTA relation extraction (RE) models trained on TACRED.
By adding some of the challenge data as training examples, the performance of the model improves.
arXiv Detail & Related papers (2020-10-07T21:17:25Z) - An Efficient Method of Training Small Models for Regression Problems
with Knowledge Distillation [1.433758865948252]
We propose a new formalism of knowledge distillation for regression problems.
First, we propose a new loss function, teacher outlier loss rejection, which rejects outliers in training samples using teacher model predictions.
By considering the multi-task network, training of the feature extraction of student models becomes more effective.
arXiv Detail & Related papers (2020-02-28T08:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.