Fugu-MT 論文翻訳(概要): IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

論文の概要: IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

arxiv url: http://arxiv.org/abs/2606.19595v1
Date: Wed, 17 Jun 2026 20:58:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.539885
Title: IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows
Title（参考訳）: IHBench: 構造化ワークフローによる音声エージェントの中断後のリカバリ評価
Authors: Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola,
Abstract要約: 既存の音声対応モデルのベンチマークでは、中断のタイミングに焦点が当てられている。音声エージェントの中断後回復を評価するベンチマークIHBenchを紹介する。 OpenAI、Google、およびオープンウェイトコミュニティから27のオーディオ言語モデル構成を評価した。
参考スコア（独自算出の注目度）: 17.25449864868632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user's interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis.
Abstract（参考訳）: 構造化ワークフロー(顧客サービス、医療スケジュール、アカウント管理)に展開される音声エージェントは、多段階手順による進捗を維持しながら、頻繁なユーザの中断を処理しなければなりません。既存の音声対応モデルのベンチマークでは、バージイン検出、エンドポイント、ターンテイキングダイナミクスといった割り込みのタイミングに焦点を当てている。エージェントは正しいステップでワークフローを再開するのか? ユーザの干渉に対処しますか? ユーザがすでに聞いたコンテンツの再配信を避けるのか? IHBench(Interruption Handling Benchmark)は、10のエンタープライズドメインにわたるステートマシン駆動ワークフローを実行する音声エージェントの中断後のリカバリを評価するベンチマークである。 6種類の割り込み型が音声中の制御点に注入され、そのデータとともに割り込み評価ルーリックが生成される。各割り込みは、タスクフルフィルメントとリカバリ品質の2つの軸でスコアされる。 OpenAI、Google、およびオープンウェイトコミュニティから27のオーディオ言語モデル構成を評価した。モデルは大きく異なり、回復の質は割り込みタイプに強く依存する。実験全体では、クローズドウェイトモデルはオープンウェイトモデルよりも割り込みに対して一貫して堅牢であり、タスクフルフィルメントにおいてより多く勝利し、会話が長くなるにつれて約3.3倍の速度で低下し、オーディオとテキストのモダリティの差は見られません。人間による研究は、LLMの判断を人間のアノテーションに対して検証し、AudioMultiChallengeに対するクロスベンチマーク分析は、回復の質がほぼ異なる能力軸であることを示唆している。

論文の概要: IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

関連論文リスト