Fugu-MT 論文翻訳(概要): Audio Interaction Model

論文の概要: Audio Interaction Model

arxiv url: http://arxiv.org/abs/2606.05121v1
Date: Wed, 03 Jun 2026 17:26:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.926518
Title: Audio Interaction Model
Title（参考訳）: 音声対話モデル
Authors: Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye, Shuicheng Yan, Chunyan Miao,
Abstract要約: 今日のLALM(Large Audio Language Models)はオフラインであり、ストリーミングオーディオモデルはASRや音声チャットのような単一のタスクのみを処理する。それは、常時オンの知覚・認知応答ループを通じて、音、環境、指示をリアルタイムで聞き、リアルタイムで反応するモデルである。我々は,この仕組みをオーディオインタラクションモデルとして定式化し,オフラインタスクの実行を継続する統合ストリーミングモデルであるAudio-Interactionで実現した。
参考スコア（独自算出の注目度）: 102.4354125819644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.
Abstract（参考訳）: オーディオは本質的にインタラクティブなモダリティであるが、今日のLALM(Large Audio Language Models)はオフラインであり、ストリーミングオーディオモデルはASRや音声チャットのような単一のタスクのみを処理する。常時オンの知覚・認知応答ループを通じて、音、環境、指示をリアルタイムで聞き、リアルタイムで反応するモデルである。音声対話モデルとしてこの仕組みを定式化し,オフラインタスクの実行を継続する統合ストリーミングモデルであるAudio-Interactionで実現した。これを実現するために,データからトレーニング,デプロイメント,ストリーミングネイティブなデータ構築,理解型トレーニング,非同期低レイテンシ推論によるリアルタイムインタラクションの安定化などを通じて,知覚とデシド対応のループ終端をインスタンス化するフレームワークSoundFlowを提案する。さらに、7つの基本能力と28のサブタスクにまたがる2.6MのストリーミングコーパスStreamAudio-2Mと、プロアクティブオーディオ介入を評価するためのProactive-Sound-Benchを構築した。 8つのベンチマークで、Audio-Interactionは主流のオーディオタスクの競合性能を保ち、オフラインのLALMにはアクセスできない。

論文の概要: Audio Interaction Model

関連論文リスト