Fugu-MT 論文翻訳(概要): UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

論文の概要: UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

arxiv url: http://arxiv.org/abs/2604.19221v2
Date: Thu, 30 Apr 2026 07:45:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 14:06:12.527017
Title: UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
Title（参考訳）: UAF: 全二重音声対話のための統一音声フロントエンドLLM
Authors: Yadong Li, Guoxin Wu, Haiping Hou, Biye Li,
Abstract要約: 音声アシスタントは人間のような会話システムに向けて人工的な伝播を駆動しています。音声活動検出(VAD)やターンテイク検出(TD)といったフロントエンドコンポーネントは、音声アシスタントにとって不可欠である。本報告では, フルグレッシブ音声システムに適した初の統合音声フロントエンド LLM (UAF) を提案する。
参考スコア（独自算出の注目度）: 7.775050285048427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.
Abstract（参考訳）: 人間のコミュニケーションの最も自然で直感的なモードであるフル二重音声対話は、人工知能を人間のような会話システムへと駆り立てている。従来のカスケードされた音声処理パイプラインは、モジュール間の遅延の蓄積、情報損失、エラーの伝搬など、重大な制限に悩まされている。これらの問題に対処するために、近年の取り組みは、主に音声理解と生成タスクを統合するGPT-4oのような、エンドツーエンドの音声大言語モデル(LLM)に焦点を当てている。しかしながら、これらのモデルの多くは本質的に半二重であり、音声活動検出(VAD)やターンテイク検出(TD)といったタスク固有のフロントエンドコンポーネント群に依存している。音声アシスタントの開発において、音声フロントエンドの最適化は、シームレスでレスポンシブなインタラクションを実現するために、バックエンド統一モデルを前進させるのと同等に重要であることを観察した。このギャップを埋めるために,本研究では,全二重音声システムに適した初の統合音声フロントエンドLLM(UAF)を提案する。本稿では,VAD,TD,話者認識(SR),自動音声認識(ASR),質問応答(QA)など,多様な音声フロントエンドタスクを1つの自動回帰シーケンス予測問題に再構成する。ストリーミング固定順オーディオチャンク(例えば600ms)を入力とし、参照オーディオプロンプトを利用してターゲット話者をアンカーし、セマンティックコンテンツとシステムレベルの制御(例えば割り込み信号)の両方をエンコードする離散トークンを逐次生成する。実験により,本モデルは複数の音声フロントエンドタスクにおける先行性能を実現し,実世界の対話シナリオにおける応答遅延と割り込み精度を大幅に向上することが示された。

論文の概要: UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

関連論文リスト