Fugu-MT 論文翻訳(概要): Abductive Preference Learning

論文の概要: Abductive Preference Learning

arxiv url: http://arxiv.org/abs/2510.09887v1
Date: Fri, 10 Oct 2025 21:55:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.671316
Title: Abductive Preference Learning
Title（参考訳）: 帰納的選好学習
Authors: Yijin Ni, Peng Qi,
Abstract要約: 提案手法は,従来の条件を,応答が与えられたプロンプトよりも優先する学習によって逆転させる,微調整のパラダイムである。標準手法は応答選択を改善し、帰納的手法は迅速な識別を改善し、マルチタスクの目的は両方を統一する。
参考スコア（独自算出の注目度）: 2.83533907065442
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer "No" to both questions "Can I eat the [food / potato chips] that has been left out overnight?" despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from $90.0\%$ to $99.5\%$ in response selection and $54.7\%$ to $85.0\%$ in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from $5.26\%$ to $6.17\%$), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.
Abstract（参考訳）: GPT-5 や Claude Sonnet のような最前線の大規模言語モデルは、Reinforcement Learning with Human Feedback (RLHF) や Direct Preference Optimization (DPO) を通じて調整しても、自信過剰な傾向にある。例えば、彼らは両方の質問に対して、同じ保守的な答え "No" を提供する傾向があります。この失敗は、既存の嗜好学習の制限によるものと考えられており、あるプロンプトに対して正しい応答を選択することを強調すると同時に、応答を変更するべき反実的なプロンプトを無視している。この制限に対処するため,提案手法は従来の条件を,応答が与えられたプロンプトよりも優先的に学習することで逆転させる微調整パラダイムである誘導的選好学習を提案する。この考え方を検証するために,1,001エントリのHaluEval QAベンチマークから導出した導出性データセットを構築し,導出性DPOとその変種DPOPを実装した。標準手法は応答選択を改善し、帰納的手法は迅速な識別を改善し、マルチタスクの目的は両方を統一する。帰納的データセットでは、マルチタスクDPOPは、応答選択で90.0\%$から99.5\%$に、迅速な識別で54.7\%$から85.0\%$に精度を高める。最後に、AlpacaEvalの評価では、マルチタスクDPOPは勝利率($5.26\%から$6.17\%)を向上し、帰納的選好学習は、非現実的なプロンプトの難題に対処しながら、従来の選好最適化の利点を保っていることを確認している。

論文の概要: Abductive Preference Learning

関連論文リスト