Fugu-MT 論文翻訳(概要): AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

論文の概要: AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

arxiv url: http://arxiv.org/abs/2510.18170v1
Date: Mon, 20 Oct 2025 23:48:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:12.730976
Title: AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI
Title（参考訳）: AgentChangeBench:会話型AIにおけるゴールシフトロバストネスのための多次元評価フレームワーク
Authors: Manik Rana, Calissa Man, Anotida Expected Msiiwa, Jeffrey Paine, Kevin Zhu, Sunishchal Dev, Vasu Sharma, Ahan M R,
Abstract要約: AgentChangeBenchは、ツール拡張言語モデルエージェントがミッドダイアログのゴールシフトにどのように適応するかを測定するために設計されたベンチマークである。本フレームワークは,タスク成功率(TSR),信頼性のためのツール利用効率(TUE),無駄な作業のためのツールコール冗長率(TCRR),適応のためのゴールシフト回復時間(GSRT)の4つの相補的指標を用いて評価を定式化する。
参考スコア（独自算出の注目度）: 5.165179548592513
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Goal changes are a defining feature of real world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts across three enterprise domains. Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Using this setup, we evaluate several frontier models and uncover sharp contrasts obscured by traditional $\text{pass}@k$ scores: for example, GPT-4o reaches $92.2\%$ recovery on airline booking shifts while Gemini collapses to $48.6\%$, and retail tasks show near perfect parameter validity yet redundancy rates above $80\%$, revealing major inefficiencies. These findings demonstrate that high raw accuracy does not imply robustness under dynamic goals, and that explicit measurement of recovery time and redundancy is essential. AgentChangeBench establishes a reproducible testbed for diagnosing and improving agent resilience in realistic enterprise settings.
Abstract（参考訳）: 目標変更は、実世界のマルチターンインタラクションの定義的な機能だが、現在のエージェントベンチマークでは、主に静的な目的やワンショットツールの使用を評価している。 AgentChangeBenchは、ツール拡張言語モデルエージェントが3つのエンタープライズドメインにまたがるミッドダイアログのゴールシフトにどのように適応するかを明示的に測定するために設計されたベンチマークである。本フレームワークは,タスク成功率(TSR),信頼性のためのツール利用効率(TUE),無駄な作業のためのツールコール冗長率(TCRR),適応遅延のためのゴールシフト回復時間(GSRT)の4つの相補的指標を用いて評価を定式化する。 AgentChangeBenchは2,835のタスクシーケンスと5人のユーザペルソナで構成される。この設定を用いて、いくつかのフロンティアモデルを評価し、従来の$\text{pass}@k$スコアで曖昧なシャープコントラストを明らかにする。例えば、GPT-4oは航空会社の予約シフトで92.2\%、ジェミニは48.6\%、小売タスクは80\%以上のパラメータ妥当性を示すが、大きな非効率性を示す。これらの結果から, 動的目標の下では高い原精度が頑健さを示唆せず, 回収時間と冗長性の明示的な測定が不可欠であることが示唆された。 AgentChangeBenchは、現実的なエンタープライズ環境でエージェントのレジリエンスを診断し改善するための再現可能なテストベッドを確立する。

論文の概要: AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

関連論文リスト