Fugu-MT 論文翻訳(概要): BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards

論文の概要: BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards

arxiv url: http://arxiv.org/abs/2510.09596v1
Date: Fri, 10 Oct 2025 17:55:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:49.51083
Title: BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards
Title（参考訳）: BaNEL: ネガティブリワードのみを用いた生成モデリングのための探索ポスター
Authors: Sangyun Lee, Brandon Amos, Giulia Fanti,
Abstract要約: BaNELは、失敗した試みのみを使用してモデルを訓練した後、報酬評価(NRE)の数を最小限にするアルゴリズムである。複数のスパース・リワードタスクにおいて1つのサンプルを観察することなく,BaNELはモデル性能を向上させることができることを示す。
参考スコア（独自算出の注目度）: 25.999630323726464
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Today's generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.
Abstract（参考訳）: 今日の生成モデルは、大量の教師付きデータと、世代の品質を特徴付ける情報的報酬関数で成長している。彼らは、教師付きデータがモデルを事前訓練するための知識を提供するという仮定の下で働き、報酬関数は、生成の品質と正確性をさらに改善する方法に関する密集した情報を提供する。しかし,重要な問題の最も困難な事例では,(1)基本生成モデルがほぼゼロの報酬信号が得られること,(2)報酬託への呼び出しが高価であること,の2つの問題が生じる。この設定は、通常の報酬ベースのポストトレーニングと根本的に異なる学習課題を引き起こします。そこで本稿では,失敗した試みのみを用いてモデルを訓練するアルゴリズムであるBaNEL(Bayesian Negative Evidence Learning)を提案する。本手法は,障害の根底にある正規性を学習する問題を,ループ内生成モデリングの別の問題として捉えることができるという考え方に基づいている。そして、このモデルを利用して、新しいデータが以前見られた障害に類似しているかどうかを評価し、世代を彼らから遠ざけます。 BaNELは、複数のスパース・リワードタスクにおいて1つの成功例を観察することなくモデル性能を向上させることができ、既存のノベルティ・ボンドアプローチを最大数桁の成功率で上回り、報酬評価を少なくする。

論文の概要: BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards

関連論文リスト