Fugu-MT 論文翻訳(概要): Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

論文の概要: Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

arxiv url: http://arxiv.org/abs/2603.15483v1
Date: Mon, 16 Mar 2026 16:14:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.579885
Title: Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis
Title（参考訳）: 講演, 評価, 診断: 自動エラー分析によるユーザ認識エージェントの評価
Authors: Penny Chong, Harshavardhan Abichandani, Jiyuan Shen, Atin Ghosh, Min Pyae Moe, Yifan Mai, Daniel Dahlmeier,
Abstract要約: 効果的なエージェント評価は、会話の質、効率性、およびエージェントエラーの体系的診断を取り入れて、正確性のみに留まらないと論じる。エージェントの旋回効率と中間進捗を両立させる新しい指標を提案する。 TEDフレームワークは、モデルとユーザの専門知識レベルをまたいだエージェントパフォーマンスに関する新たな洞察を明らかにします。
参考スコア（独自算出の注目度）: 3.3237915628874632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.
Abstract（参考訳）: エージェントアプリケーションは、さまざまなタスクにわたるワークフローを自動化するために、ますます採用されている。しかし、それらが運用する異種ドメインのため、スケーラブルな評価フレームワークを作成するのは難しい。以前の作業では、それぞれが独自のメソッドを使用して、データベースのルックアップやregex Matchなどのタスク成功を判断し、統合されたエージェント評価アプローチの開発に複雑さを追加する。さらに,インタラクションにおけるユーザの役割や専門知識を体系的に説明せず,エージェントのパフォーマンスに関する不完全な洞察を提供する。効果的なエージェント評価は、会話の質、効率性、およびエージェントエラーの体系的診断を取り入れて、正確性のみに留まらないと論じる。これを解決するために、TEDフレームワーク(Talk, Evaluate, Diagnose)を紹介します。 1) 講演: ユーザとエージェントのインタラクションに,再利用可能なジェネリックエキスパートと非専門家のペルソナテンプレートを活用する。 2)評価: LLM-as-a-judgeで自動的に評価されるツールシグネチャや自然言語グレーディングノートなどのサブゴールを表現して既存のデータセットを適応する。本稿では,ユーザの認識した設定を補完するエージェントのターン効率と中間進捗を両立させる新しい指標を提案する。 (3) 診断: 判断とエージェントの不整合を分析し, 共通エラーを明らかにし, エージェント改善のための実用的なフィードバックを提供する自動エラー解析ツールを導入する。 TEDフレームワークは、モデルとユーザの専門知識レベルをまたいだエージェントパフォーマンスに関する新たな洞察を明らかにします。また,提案手法をエージェント設計に組み込んだ結果,提案手法の8～10%のピークでエージェント性能が向上する可能性を示した。

論文の概要: Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

関連論文リスト