Fugu-MT 論文翻訳(概要): Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

論文の概要: Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

arxiv url: http://arxiv.org/abs/2605.29430v1
Date: Thu, 28 May 2026 06:23:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.85389
Title: Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
Title（参考訳）: エージェント補正と意味評価による人間的対話型音声認識の実現に向けて
Authors: Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen,
Abstract要約: 単一パスのASRフロントエンドと意味的訂正,意図のルーティング,推論に基づく編集を組み合わせた閉ループフレームワークである textbfAgentic ASR を提案する。複数言語、名前付き集中型、コードスイッチングベンチマークの実験は、反復的相互作用が意味的誤りを一貫して減少させることを示している。
参考スコア（独自算出の注目度）: 53.844308305341166
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/
Abstract（参考訳）: 自動音声認識(ASR)は、人間とコンピュータの相互作用の中核となるコンポーネントであり、LLMベースのアシスタントやエージェントにとってますます重要なフロントエンドである。しかし、現在のほとんどのASRシステムはシングルパスのパラダイムを踏襲しており、人間のコミュニケーションと不一致であり、誤解は反復的明確化と洗練によって解決される。このミスマッチは、一度発生すると意味クリティカルなエラーを修正するのを難しくする。一方、WERやCERのようなトークンレベルのメトリクスは、そのような問題を適切に反映することはできない。これらの制約に対処するため,多ターン改良タスクとして \emph{Interactive ASR} を定式化し,単一パスのASRフロントエンドと意味的修正,意図的ルーティング,推論に基づく編集を組み合わせたクローズドループフレームワークである \textbf{Agentic ASR} を提案する。さらに,LLMに基づく意味評価指標である「textbf{Sentence-level Semantic Error Rate}」(S^2ER$)と,スケーラブルで再現可能なベンチマークのための「textbf{Interactive Simulation System}」を紹介する。多言語、名前付き集中型、コードスイッチングベンチマークの実験は、反復的相互作用が従来のトークンレベルの指標よりもずっと大きなS^2ER$で、意味的エラーを一貫して減少させることを示している。人間-AIアライメントとアブレーションの研究は、セマンティック・ジャッジの信頼性と提案フレームワークの堅牢性をさらに検証する。コードはhttps://interactiveasr.github.io/で、ライブデモはhttps://i-asr.sjtuxlance.com/で入手できる。

論文の概要: Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

関連論文リスト