Fugu-MT 論文翻訳(概要): Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

論文の概要: Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

arxiv url: http://arxiv.org/abs/2509.23166v1
Date: Sat, 27 Sep 2025 07:46:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.079618
Title: Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs
Title（参考訳）: LLMによる高機能マルチTurnインタラクションのためのテスト時間ポリシー適応
Authors: Chenxing Wei, Hong Wang, Ying He, Fei Yu, Yao Shu,
Abstract要約: T2PAM(Test-Time Policy Adaptation for Multi-Turn Interactions)について紹介する。まず,ユーザのフィードバックを報奨信号として利用し,ユーザの嗜好に合致した潜在最適ポリシーを推定する,新しいパラダイムT2PAMを提案する。次に,T2PAM を演算する軽量アルゴリズムであるOptimum-Referenced One-Step Adaptation (ROSA) を導入する。
参考スコア（独自算出の注目度）: 20.892283201423048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.
Abstract（参考訳）: 大規模言語モデル(LLM)は、複雑なタスクを完了するための基本的なパラダイムとしてマルチターンインタラクションを採用している。しかしながら、そのパフォーマンスは、通常、静的なシングルターンデータに基づいてトレーニングされ、リアルタイムのユーザフィードバックに適応する能力を妨げるため、拡張されたインタラクションにおいて劣化することが多い。この制限に対処するため、我々はまずT2PAM(Test-Time Policy Adaptation for Multi-Turn Interactions)という新しいパラダイムを提案する。次に,T2PAM を演算する軽量アルゴリズムであるOptimum-Referenced One-Step Adaptation (ROSA) を導入する。 ROSAは、モデルパラメータを1つの効率的な更新ステップで理論的最適ポリシーへ誘導し、コストのかかる反復的な勾配ベースの最適化を回避し、計算オーバーヘッドを最小限にする。 ROSAのポリシーがユーザの好みに収束することを保証する厳密な理論分析を提供する。挑戦的なベンチマークに関する大規模な実験は、ROSAがタスクの有効性と効率の両方において大幅な改善を達成していることを示している。

論文の概要: Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

関連論文リスト