Fugu-MT 論文翻訳(概要): Mind the Sim2Real Gap in User Simulation for Agentic Tasks

論文の概要: Mind the Sim2Real Gap in User Simulation for Agentic Tasks

arxiv url: http://arxiv.org/abs/2603.11245v1
Date: Wed, 11 Mar 2026 19:12:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.605849
Title: Mind the Sim2Real Gap in User Simulation for Agentic Tasks
Title（参考訳）: エージェントタスクのユーザシミュレーションにおけるSim2Real Gapの考え方
Authors: Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, Maarten Sap,
Abstract要約: ユーザシミュレーションにおけるSim2Realのギャップを形式化し、実際の人間に対して$$$-benchプロトコルを実行する最初の研究を示す。 LLMシミュレータは過度に協調的であり、スタイリスティックに均一であり、現実的なフラストレーションや曖昧さを欠いている。これらの知見は, LLMベースのユーザシミュレータをエージェント開発サイクルで使用する際の人間による検証の重要性を強調した。
参考スコア（独自算出の注目度）: 101.69142591891234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As NLP evaluation shifts from static benchmarks to multi-turn interactive settings, LLM-based simulators have become widely used as user proxies, serving two roles: generating user turns and providing evaluation signals. Yet, these simulations are frequently assumed to be faithful to real human behaviors, often without rigorous verification. We formalize the Sim2Real gap in user simulation and present the first study running the full $τ$-bench protocol with real humans (451 participants, 165 tasks), benchmarking 31 LLM simulators across proprietary, open-source, and specialized families using the User-Sim Index (USI), a metric we introduce to quantify how well LLM simulators resemble real user interactive behaviors and feedback. Behaviorally, LLM simulators are excessively cooperative, stylistically uniform, and lack realistic frustration or ambiguity, creating an "easy mode" that inflates agent success rates above the human baseline. In evaluations, real humans provide nuanced judgments across eight quality dimensions while simulated users produce uniformly more positive feedback; rule-based rewards are failing to capture rich feedback signals generated by human users. Overall, higher general model capability does not necessarily yield more faithful user simulation. These findings highlight the importance of human validation when using LLM-based user simulators in the agent development cycle and motivate improved models for user simulation.
Abstract（参考訳）: NLP評価が静的ベンチマークからマルチターンインタラクティブな設定へとシフトするにつれ、LCMベースのシミュレータはユーザプロキシとして広く使われ、ユーザターンの生成と評価信号の提供という2つの役割を担っている。しかし、これらのシミュレーションは、しばしば厳密な検証なしに、実際の人間の行動に忠実であると考えられている。我々はユーザシミュレーションにおけるSim2Realギャップを形式化し、実際の人間(451人の参加者、165のタスク)でフル$τ$-benchプロトコルを実行する最初の研究を提示する。行動学的には、LCMシミュレータは過度に協調的であり、スタイリスティックに均一であり、現実的なフラストレーションや曖昧さが欠如しているため、人間のベースラインよりもエージェントの成功率を膨らませる「容易なモード」を作り出している。評価では、実際の人間は8つの品質次元にわたるニュアンスな判断を提供し、シミュレーションされたユーザは一様にポジティブなフィードバックを得られる。全体として、より高い汎用モデル能力は、必ずしもより忠実なユーザーシミュレーションをもたらすとは限らない。これらの知見は, LLMベースのユーザシミュレータをエージェント開発サイクルで使用する場合の人間による検証の重要性を強調し, ユーザシミュレーションのモデルの改善を動機づけるものである。

論文の概要: Mind the Sim2Real Gap in User Simulation for Agentic Tasks

関連論文リスト