Towards Audio Token Compression in Large Audio Language Models
- URL: http://arxiv.org/abs/2511.20973v1
- Date: Wed, 26 Nov 2025 02:00:38 GMT
- Title: Towards Audio Token Compression in Large Audio Language Models
- Authors: Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass,
- Abstract summary: Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks.<n>However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals.<n>This paper explores techniques to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder.
- Score: 26.379508239446935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
Related papers
- LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence [35.123477091633866]
LAMB is an audio captioning framework that bridges the modality gap between audio embeddings and the text embedding space.<n>A Cross-Modal Aligner minimizes Cauchy-Schwarz divergence while maximizing mutual information.<n>A Two-Stream Adapter that extracts semantically enriched audio embeddings delivers richer information to the Cross-Modal Aligner.
arXiv Detail & Related papers (2026-01-08T07:05:35Z) - AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs [53.248502396225724]
AudioMarathon is a benchmark designed to evaluate both understanding and inference efficiency on long-form audio.<n>We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows.<n>The results show large gaps across current LALMs and highlight the need for better temporal reasoning.
arXiv Detail & Related papers (2025-10-08T17:50:16Z) - PAL: Probing Audio Encoders via LLMs - Audio Information Transfer into LLMs [29.049167884343998]
Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications.<n>We propose an efficient alternative, Lightweight Audio LLM Integration (LAL)<n>LAL encodes rich audio semantics at an appropriate level of abstraction for integration into different blocks of LLMs.
arXiv Detail & Related papers (2025-06-12T07:23:07Z) - Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model [85.72664004969182]
We introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks.<n>The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction.<n>Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence.
arXiv Detail & Related papers (2025-06-10T16:37:39Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z) - Make Some Noise: Towards LLM audio reasoning and generation using sound tokens [19.48089933713418]
We introduce a novel approach that combines Variational Quantization with Flow Matching to convert audio into ultra-low discrete tokens of 0.23kpbs.<n>Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events.
arXiv Detail & Related papers (2025-03-28T09:43:47Z) - Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders [36.528216873338614]
We propose to incorporate mixtures of weak' encoders into the AudioLLM framework.<n>MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size.<n>Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
arXiv Detail & Related papers (2024-09-10T16:46:18Z) - LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks.
We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.