Fugu-MT 論文翻訳(概要): Causally-Enhanced Reinforcement Policy Optimization

論文の概要: Causally-Enhanced Reinforcement Policy Optimization

arxiv url: http://arxiv.org/abs/2509.23095v1
Date: Sat, 27 Sep 2025 04:10:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.033215
Title: Causally-Enhanced Reinforcement Policy Optimization
Title（参考訳）: 因果強化強化政策最適化
Authors: Xiangqi Wang, Yue Huang, Yujun Zhou, Xiaonan Luo, Kehan Guo, Xiangliang Zhang,
Abstract要約: Causally-Enhanced Policy Optimization (CE-PO)は、因果一貫性のための異なるプロキシでポリシー最適化を強化する、ドロップイン報酬形成フレームワークである。 CE-POは、ヤコビアンに基づく感性によるモデル内部の影響を推定し、これらのシグナルを反実的に硬化させてニュアンスを抑えるとともに、結果のコヒーレンススコアをタスク精度フィードバックと融合させる。 4つのデータセットにわたる実験結果から、CE-POは平均で5.49%の精度(最大9.58%)を向上し、相関因果フリップや光対実編集による堅牢性を改善した。
参考スコア（独自算出の注目度）: 36.523007244998695
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt (Z) to rationale (X) to answer (Y). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation-causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49% on average (up to 9.58%), while improving robustness to correlation-causation flips and light counterfactual edits.
Abstract（参考訳）: 強化目的で訓練された大規模言語モデル(LLM)は、しばしばショートカット戦略によって表面的に正しい答えを達成し、刺激的または不誠実な推論と正しい出力をペアリングし、小さな因果摂動の下で劣化させる。因果コヒーレンス(因果コヒーレンス)の生成経路に沿って,プロンプト (Z) から理性 (X) への応答 (Y) を微分可能なプロキシを用いて,政策最適化を強化するドロップイン報酬形成フレームワークであるCausally-Enhanced Policy Optimization (CE-PO) を紹介する。 CE-POは、ヤコビアン系感性によるモデル内的影響を推定し、これらのシグナルを反実的に硬化させてニュアンスキューを抑えるとともに、結果のコヒーレンススコアをミンコフスキー(パワー平均)コンバインダを介してタスク精度フィードバックと融合させ、精度とコヒーレンストレードオフの間に単一の調整可能な点を露呈する。統一報酬はアーキテクチャの変更なしにPPO/GRPOと統合される。推論ベンチマークと因果ストレステスト全体で、CE-POは報酬のハッキングと不信心の連鎖を減らし、相関因果関係のフリップと軽い偽物編集に対する堅牢性を改善した。 4つのデータセットにわたる実験結果から、CE-POは平均で5.49%の精度(最大9.58%)を向上し、相関因果フリップや光対実編集による堅牢性を改善した。

論文の概要: Causally-Enhanced Reinforcement Policy Optimization

関連論文リスト