Fugu-MT 論文翻訳(概要): Towards Audio Token Compression in Large Audio Language Models

論文の概要: Towards Audio Token Compression in Large Audio Language Models

arxiv url: http://arxiv.org/abs/2511.20973v1
Date: Wed, 26 Nov 2025 02:00:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-27 18:37:58.919372
Title: Towards Audio Token Compression in Large Audio Language Models
Title（参考訳）: 大規模音声モデルにおける音声トーケン圧縮に向けて
Authors: Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass,
Abstract要約: 大規模オーディオ言語モデル(LALM)は、様々なタスクにまたがる素晴らしいパフォーマンスを示している。しかし、そのスケーラビリティは、注意の二次的な複雑さと、音声信号の高いトークンレートによって制限される。本稿では,LALMのオーディオエンコーダが生成する音声トークン数を,LCMデコーダが消費する前に削減する手法について検討する。
参考スコア（独自算出の注目度）: 26.379508239446935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
Abstract（参考訳）: 大規模音声言語モデル(LALM)は、音声認識から一般的な音声理解に至るまで、様々なタスクにまたがる印象的なパフォーマンスを示す。しかし、そのスケーラビリティは、注意の二次的な複雑さと、音声信号の高いトークンレートによって制限される。これらの課題により、LALMをロングフォームオーディオに拡張し、エッジデバイスのようなリソース制約のあるプラットフォームにデプロイすることが困難になる。本稿では,LALMのオーディオエンコーダが生成する音声トークンの数を,LCMデコーダが消費する前に削減するために,教師なしセグメンテーションや一様平均プーリングなどの手法を検討する。圧縮表現によって生じる潜在的な性能劣化を軽減するため,低ランクアダプタを用いてモデルを微調整する。提案手法は,入力信号の語彙的内容を効果的に発見することに依存する自動音声認識と音声合成タスクの2つのタスクで評価し,これらのタスクに対するダウンサンプリングの効果について検討する。実験の結果, 圧縮LALMは, LLMバックボーンの最大3倍の入力音声トークン数を削減しつつ, フレームレベルLALMに近い性能が得られることがわかった。

論文の概要: Towards Audio Token Compression in Large Audio Language Models

関連論文リスト