Fugu-MT 論文翻訳(概要): Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

論文の概要: Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.15514v2
Date: Tue, 21 Oct 2025 03:35:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:11.621311
Title: Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning
Title（参考訳）: 裁判官に挑戦する - 安定した強化学習のためのAIフィードバックの分離
Authors: Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao,
Abstract要約: この研究は、強化学習トレーニングループ内の不整合を検出し、解決するためのエンドツーエンドのフレームワークを導入している。我々のフレームワークは2つの中核となるコンフリクト検出率 (CDR) と信号浄化フレームワークであるDeconflicted Graph Rewards (DGR) を特徴としている。実験により、我々のフレームワークは、強力なベースラインよりもトレーニングの安定性とモデル性能を大幅に改善することを確認した。
参考スコア（独自算出の注目度）: 46.661195064495
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the critical issue of logical coherence particularly preference cycles has been largely unaddressed. To address this gap, this work introduces an end to end framework to systematically detect and resolve these inconsistencies within the reinforcement learning training loop. Our framework features two core contributions: the Conflict Detection Rate (CDR), a novel metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a signal-purification framework that eliminates cycles before policy optimization. DGR constructs preference graphs from raw judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal compatible with any policy optimizer. Experiments confirm that our framework significantly improves training stability and model performance over strong baselines, establishing logical consistency as a crucial and now-addressable dimension of AI feedback. The code for our method is available at https://github.com/modelscope/RM-Gallery.
Abstract（参考訳）: LLMの判断フィードバックを用いた言語モデルの調整は、人間のアノテーションに代わるスケーラブルな代替手段を提供するが、強化学習を不安定にする判断の不整合に悩まされている。以前の研究は判断精度に焦点が当てられていたが、論理的コヒーレンス、特に選好サイクルの重大な問題は、ほとんど適用されていない。このギャップに対処するため、この研究は、強化学習トレーニングループ内のこれらの不整合を系統的に検出し、解決するエンド・ツー・エンド・フレームワークを導入している。我々のフレームワークは2つの中核となるコンフリクト検出率(CDR)と、ポリシー最適化前のサイクルを排除した信号浄化フレームワークであるDeconflicted Graph Rewards(DGR)を特徴としている。 DGRは、生の判断から選好グラフを構築し、それらを競合のない非巡回グラフ(DAG)に変換し、任意のポリシーオプティマイザと互換性のある論理的に一貫性のある報酬信号を生成する。実験により、我々のフレームワークは、強力なベースラインよりもトレーニングの安定性とモデルパフォーマンスを大幅に改善し、論理的一貫性をAIフィードバックの重要かつ適応可能な次元として確立することを確認した。私たちのメソッドのコードはhttps://github.com/modelscope/RM-Gallery.comで公開されています。

論文の概要: Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

関連論文リスト