Fugu-MT 論文翻訳(概要): Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

論文の概要: Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

arxiv url: http://arxiv.org/abs/2510.25441v1
Date: Wed, 29 Oct 2025 12:08:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-30 15:50:45.497502
Title: Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Title（参考訳）: オフラインログからプロアクティブLSMを学習し、デプロイする
Authors: Fei Wei, Daoyuan Chen, Ce Wang, Yilun Huang, Yushuo Chen, Xuchen Pan, Yaliang Li, Bolin Ding,
Abstract要約: textttLearn-to-Askは、プロアクティブな対話エージェントの学習とデプロイのためのシミュレータフリーフレームワークである。当社のアプローチは,LLMの大規模オンラインAIサービスへの展開を成功に導くものです。
参考スコア（独自算出の注目度）: 72.08224879435762
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.
Abstract（参考訳）: 大規模言語モデル(LLM)は受動的応答者として優れていますが、積極的に目標指向のパートナであることを教えています。現在のパラダイムは、ミオプティックに単一ターン属性を最適化するか、不安定で高コストなユーザシミュレータに依存し、永続的な‘現実のギャップ’を作り出す。このギャップを埋めるために、我々は、複雑なユーザダイナミクスをモデル化する必要をなくし、プロアクティブな対話エージェント \textit{directly from offline expert data} を学習およびデプロイするための一般的なシミュレータフリーフレームワークである \textt{Learn-to-Ask} を紹介した。我々の重要な洞察は、各専門家の軌跡の‘textbf{observed future’を活用することによって、オフラインのポリシー学習問題を再構築することである。これにより、専門家が明らかにした戦略に根ざした高密度でターンバイターンの報奨信号を推測し、難解な長距離問題を一連の教師付き学習タスクに分解し、構造化された \texttt{(action, state_assessment") タプルを出力するポリシーを訓練し、 \textbf{what to ask} と決定的に \textbf{when to Stop} の両方を制御できる。報奨の忠実性を確保するため、私たちのAutomated Grader Calibrationパイプラインは、人間の監督を最小限に抑えたLLMベースの報奨モデルからノイズを体系的に浄化する。実世界の医療データセットにおいて, 最大32BまでのLLMを用いて, texttt{Learn-to-Ask}の有効性を実証した。当社のアプローチは,LLMの大規模オンラインAIサービスへの展開を成功に導くものです。厳格な社内評価では、私たちのモデルはローンチされ、人間の専門家よりも優れたパフォーマンスを達成しました。この研究は、受動LDMを積極的にゴール指向LSMアプリケーションに変換するための実用的で経済的に実行可能な青写真を提供してくれることを願っている。

論文の概要: Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

関連論文リスト