Fugu-MT 論文翻訳(概要): EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

論文の概要: EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

arxiv url: http://arxiv.org/abs/2605.13841v1
Date: Wed, 13 May 2026 17:58:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.227829
Title: EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
Title（参考訳）: EVA-Bench: 音声エージェントを評価するための新しいエンドツーエンドフレームワーク
Authors: Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara,
Abstract要約: EVA-Benchは、音声エージェントのエンドツーエンド評価フレームワークである。動的マルチターン対話を通じてボット間音声会話をオーケストレーションする。タスク完了、忠実度、および音声レベルの音声の忠実度をキャプチャする。また、会話の進行、会話の簡潔さ、ターンテイキングのタイミングもキャプチャする。
参考スコア（独自算出の注目度）: 3.0301675282070577
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.
Abstract（参考訳）: 音声エージェント、音声会話でタスクを完了させる人工知能システムは、エンタープライズアプリケーションにまたがってますます普及している。しかしながら、既存のベンチマークでは、現実的なシミュレートされた会話の生成と、音声固有の障害モードの全スコープにわたる品質の測定という、2つの中核的な評価課題に共同で対処していない。 EVA-Benchは、双方に対処するエンドツーエンド評価フレームワークである。シミュレーション側では、EVA-Benchは動的マルチターン対話を通じてボット間音声会話を編成し、ユーザのシミュレータエラーを検出し、スコアの前に会話を適切に再生する自動シミュレーション検証を行う。測定面では、EVA-Benchは、EVA-A(精度)、タスク完成度、忠実度、音声レベルの音声の忠実度、EVA-X(経験)、会話の進行度、音声の簡潔さ、ターンテイクタイミングの2つの複合指標を導入している。どちらのメトリクスも異なるエージェントアーキテクチャに適用され、アーキテクチャ間の直接比較を可能にします。 EVA-Benchには、3つのエンタープライズドメインにまたがる213のシナリオ、アクセントとノイズの堅牢性を管理する制御された摂動スイート、およびピークを信頼性のある能力と区別するpass@1、pass@k、pass^k測定が含まれている。 1) EVA-A pass@1 と EVA-X pass@1 は同時に0.5を超えるシステムはなく、(2) ピークおよび信頼性の高い性能は、大きく変化している(中規模パス@k - パス^k の EVA-A では 0.44 のギャップ)。オープンソースライセンスの下で、完全なフレームワーク、評価スイート、ベンチマークデータをリリースします。

論文の概要: EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

関連論文リスト