Fugu-MT 論文翻訳(概要): SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

論文の概要: SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

arxiv url: http://arxiv.org/abs/2510.05444v1
Date: Mon, 06 Oct 2025 23:17:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.021472
Title: SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
Title（参考訳）: SimulatorArena: ユーザシミュレータはAIアシスタントのマルチTurn評価のための信頼性の高いプロキシか?
Authors: Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, Jianfeng Gao,
Abstract要約: 大規模言語モデル(LLM)は、対話型アプリケーションでますます使われている。人間の評価は、マルチターン会話におけるパフォーマンスを評価するためのゴールドスタンダードのままである。我々は、909の注釈付き人間とLLMの会話を2つの対話タスクで行うベンチマークであるSimulatorArenaを紹介した。
参考スコア（独自算出の注目度）: 61.07963107032645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce SimulatorArena, a benchmark of 909 annotated human-LLM conversations on two interactive tasks -- math tutoring and document creation. SimulatorArena evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman's $\rho$ of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation. Using the best simulator for each task, we benchmark 18 assistants, including the latest LLMs such as GPT-5, Claude 4.1 Opus, and Gemini 2.5 Pro.
Abstract（参考訳）: 対話型アプリケーションでは大規模言語モデル(LLM)がますます使われており、人間による評価はマルチターン会話におけるパフォーマンスを評価するためのゴールドスタンダードのままである。人間の研究は費用がかかり、時間がかかり、再現が難しいため、最近の研究はLLMを用いてユーザーをシミュレートし、自動アシスタント評価を行う。しかし、これらのシミュレーションされたユーザが実際のユーザにとって信頼できるスタンドインであるかどうかを評価するためのベンチマークや体系的な研究は行われていない。これを解決するために、SimulatorArenaを紹介します。これは、数学のチュータリングと文書作成という2つの対話的なタスクに関する、909の注釈付き人間とLLMの会話のベンチマークです。 SimulatorArenaは、そのメッセージが人間の行動にどの程度近いか、そしてアシスタントの格付けが人間の判断とどのように一致しているかに基づいて、シミュレータを評価する。様々なシミュレーター手法の実験では、ユーザプロファイルに条件付けされたシミュレータが、背景やメッセージスタイルなどの特性をキャプチャし、人間の判断と密接に一致していることが示されている。彼らは両方のタスクでSpearmanの$\rho$ 0.7に達し、人間の評価に代わる実用的でスケーラブルな代替手段を提供する。 GPT-5、Claude 4.1 Opus、Gemini 2.5 Proといった最新のLLMを含む18のアシスタントをベンチマークする。

論文の概要: SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

関連論文リスト