Fugu-MT 論文翻訳(概要): Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

論文の概要: Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

arxiv url: http://arxiv.org/abs/2510.04214v1
Date: Sun, 05 Oct 2025 14:08:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.522866
Title: Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards
Title（参考訳）: LLMを説得力のあるものに教える:"Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards"
Authors: Zhuoran Zhuang, Ye Chen, Xia Zeng, Chao Luo, Luhui Liu, Yihan Chen,
Abstract要約: 我々は,大規模言語モデル(LLM)をビジネス開発(BD)エージェントとして展開し,オンライン旅行代理店(OTA)における説得的価格交渉を行う。 Reward-Enhanced Policy Optimization (REPO) は、LLMを不均一な報酬と整合する強化学習後学習フレームワークである。 RMとRJとRF信号を組み合わせることで、報酬のハッキングを抑え、交渉の質を向上させるため、簡単な拡張機構が提案されている。
参考スコア（独自算出の注目度）: 16.217316324851343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.
Abstract（参考訳）: 本研究は,大規模言語モデル (LLM) をビジネス開発 (BD) エージェントとして展開し,旅行代行機関 (OTA) における説得力のある価格交渉を行う。エージェントは、マルチターンの説得を行い、口語入力を解釈し、ガードレールに固執する(過剰なプロミッシング、幻覚なし)間、標準操作手順(SOP)に従う必要がある。従来のポストトレーニング -- 教師付き微調整(SFT)やシングルソースの報酬最適化 -- は、スクリプトに過度に適合し、説得力に欠けるスタイルを見逃し、検証可能なビジネス制約を強制することができない。 Reward-Enhanced Policy Optimization (REPO) は、LLMと不均質な報酬とを整合させる強化学習後学習フレームワークであり、密集した人間アライメントのための嗜好訓練報酬モデル(RM)、高レベルの説得行動およびSOPコンプライアンスのための報奨判断モデル(RJ)、数値、フォーマット、ガードレールに関する決定論的チェックのためのプログラム報酬関数(RF)を提案する。 RMとRJとRF信号を組み合わせることで、報酬のハッキングを抑え、交渉の質を向上させるため、簡単な拡張機構が提案されている。 REPOは平均的な対話格付けを4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO), +0.33 over Group Relative Policy Optimization (GRPO), more0.33 over Group Relative Policy Optimization (GRPO), increase the share of conversation with least one excellent response to 66.67% (+23.34 points over GRPO), and achieve a 93.33% bad-case Fix rate with 75.56% clean fix, outforming SFT, DPO, PPO, GRPO。我々はまた、金のアノテーションを超える創発的な能力、積極的共感、局所的推論、調整された戦術も観察する。

論文の概要: Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

関連論文リスト