Fugu-MT 論文翻訳(概要): Continuous Audio Thinking for Large Audio Language Models

論文の概要: Continuous Audio Thinking for Large Audio Language Models

arxiv url: http://arxiv.org/abs/2606.18273v1
Date: Fri, 05 Jun 2026 11:38:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-21 20:00:42.795299
Title: Continuous Audio Thinking for Large Audio Language Models
Title（参考訳）: 大規模音声言語モデルのための連続音声思考
Authors: Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim,
Abstract要約: 応答生成に先立って音響情報を整理するための連続的な潜時ワークスペースを備えた音声モデルを実現するために,Continuous Audio Thinking (CoAT)を導入した。思考空間内では、モデルはその応答を生成する際に専門家の蒸留によって提供される豊かな音響情報を利用することができる。 CoATは、ベースラインに対する追加の自己回帰復号化コストを必要としない。
参考スコア（独自算出の注目度）: 16.335310406868217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.
Abstract（参考訳）: 大規模な音声言語モデル(LALM)は、音声の書き起こしから音楽分析まで、様々な音声理解タスクにおいて印象的な機能を示している。しかし、LALMは通常、テキスト整列応答を生成するために訓練されているため、その隠れた状態は、音響情報の保存ではなく、テキスト生成のために徐々に形作られていく。その結果、音声が伝達する多様な音響コンテンツ、例えば、音のディテール、韻律、音のイベント、影響、ピッチは、途中で失われ、応答の活用が困難になる。本研究では,音声専門家の蒸留を基礎として,応答生成に先立って音響情報を整理するための連続的な潜時ワークスペースを音響モデルに組み込むフレームワークであるContinuous Audio Thinking (CoAT)を紹介する。思考空間内では、その応答を生成する際に専門家の蒸留によって提供される豊かな音響情報を利用することができる。さらに、提案した連続的思考ブロックは1つのプリフィルで処理できるため、CoATはベースラインに追加の自己回帰復号コストを必要としない。 3つのLALM、Qwen2-Audio、Qwen2.5-Omni-7B、Audio Flamingo~3では、音声推論、音声理解、音楽分類、音声感情、音声の書き起こしがCoATの有効性を示している。さらなる分析により、補助的な監督が思考位置からモデルのテキスト応答へと伝播することを確認する。

論文の概要: Continuous Audio Thinking for Large Audio Language Models

関連論文リスト