Fugu-MT 論文翻訳(概要): CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

論文の概要: CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

arxiv url: http://arxiv.org/abs/2603.04406v1
Date: Mon, 02 Feb 2026 12:21:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.181048
Title: CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models
Title（参考訳）: CTRL-RAG:文脈差RAGモデルのためのコントラスト的リワードに基づく強化学習
Authors: Zhehao Tan, Yihan Jiao, Dan Yang, Junjie Wang, Duolin Sun, Jie Feng, Xidong Wang, Lei Liu, Yue Shen, Jian Wang, Jinjie Gu,
Abstract要約: 既存のRAG指向強化学習法は、文書の忠実さを評価するのにしばしば失敗する外部報酬に依存している。コントラッシブ・ライリフッド・リワード(CLR)を中心とした新たな「内外的」ハイブリッド報酬枠組を提案する。 CLRは、エビデンスをサポートしないプロンプトで条件付けられたレスポンス間のログライクなギャップを直接最適化する。
参考スコア（独自算出の注目度）: 29.703793991791674
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.
Abstract（参考訳）: Retrieval-Augmented Generation (RAG)の使用の増加に伴い、文脈に敏感な推論と忠実さのための大規模言語モデル(LLM)のトレーニングがますます重要になっている。既存のRAG指向強化学習(RL)手法は、文書の忠実さを評価するのにしばしば失敗する外部報酬に依存しており、オープンドメイン設定でも同様の回答を誤る可能性がある。加えて、RAGベースの自己回帰機構は存在しない。さらに、そのようなメカニズムは原則として文書に対する回答の信頼性を見積もることができるが、自己判断における客観的フィードバックの欠如は幻覚の蓄積と最終的なモデル崩壊を引き起こす可能性がある。これらの課題に対処するために, Contrastive Likelihood Reward (CLR) を中心とした新たな「内外的」ハイブリッド報酬フレームワークを提案する。 CLRは、エビデンスをサポートしないプロンプトで条件付けられたレスポンス間のログライクなギャップを直接最適化する。これにより、モデルが関連する証拠を抽出し、特定の文脈においてその信頼性を高めることが促される。実験により, 単一ホップ, マルチホップ, 垂直ドメイン, 忠実度ベンチマークにおいて, 提案手法と外部正当性報酬の併用により, 高い性能が得られた。トレーニングコードとモデルも近く公開される予定です。

論文の概要: CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

関連論文リスト