Fugu-MT 論文翻訳(概要): Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

論文の概要: Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

arxiv url: http://arxiv.org/abs/2605.09808v1
Date: Sun, 10 May 2026 23:06:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.428371
Title: Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
Title（参考訳）: 協調型LLMアシスタント構築のためのユーザシミュレータの有用性の定量化
Authors: Joseph Suh, Ayush Raj, Minwoo Kang, Serina Chang,
Abstract要約: 下流ユーティリティの観点からシミュレータの品質を定量化する方法を示す。我々は、シミュレータのスペクトルに対して強化学習によってLLMアシスタントを訓練する。評価として,283名を対象にしたユーザスタディにおいて,ペアワイズ勝利率を測定した。
参考スコア（独自算出の注目度）: 7.523995265564992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human--AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator's model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.
Abstract（参考訳）: ユーザーシミュレータは、インタラクティブなAIアシスタントを構築するためにますます活用されている。本研究は,本シミュレータでトレーニングしたLLMアシスタントが,実人間と対話する際の動作の仕方について,下流のユーティリティの観点からシミュレータ品質を定量化する方法を示す。ユーザシミュレータのみが異なる制御実験では、ユーザをロールプレイするLLMから、WildChatからの人間の発話を微調整するLLMまで、複数のシミュレータに対して強化学習によってLLMアシスタントを訓練する。評価として,283人の被験者と実際の人間-AI会話から得られたベンチマークであるWildBenchを用いて,一対当たりの勝利率を測定した。ロールプレイング LLM に対するトレーニングは,ユーザ調査において初期アシスタントと統計的に区別できない(51%の勝利率)が,微調整シミュレータによるトレーニングでは,ロールプレイングに対するトレーニングでは58%,ロールプレイングに対するトレーニングでは57%)。より綿密な検査により、ロールプレイング LLM をより現実的にする方法(例えば、ペルソナ条件付け)は、訓練されたアシスタントを改善するが、微調整されたシミュレータとのギャップを埋めない、シミュレータのモデルサイズを拡大することは、微調整されたシミュレータに利益をもたらすが、ロールプレイングシミュレータに利益をもたらすことはない、ロールプレイングシミュレータに対して訓練されたアシスタントは、テスト時に他のシミュレータとペアになっても一般化できない、という3つのパターンが明らかになった。これらの結果は,実際の人間行動におけるユーザシミュレータの接地と,実際のユーザに対する下流効果による品質評価の両立を主張する。

論文の概要: Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

関連論文リスト