Related papers: PAL: Probing Audio Encoders via LLMs - Audio Information Transfer into LLMs

PAL: Probing Audio Encoders via LLMs - Audio Information Transfer into LLMs

URL: http://arxiv.org/abs/2506.10423v2
Date: Tue, 14 Oct 2025 20:14:40 GMT
Title: PAL: Probing Audio Encoders via LLMs - Audio Information Transfer into LLMs
Authors: Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson, Imran Razzak, Muhammad Awais,
Abstract summary: Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications.<n>We propose an efficient alternative, Lightweight Audio LLM Integration (LAL)<n>LAL encodes rich audio semantics at an appropriate level of abstraction for integration into different blocks of LLMs.
Score: 29.049167884343998
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications, yet efficient transfer of rich audio semantics from audio encoders to LLMs remains underexplored. The most widely used integration paradigm projects the audio encoder output tokens into the LLM input space (e.g., via an MLP or a Q-Former), then prepends or inserts them to the text tokens. We refer to this generic scheme as Prepend to the LLM's input token space (PLITS) integration. We propose an efficient alternative, Lightweight Audio LLM Integration (LAL). LAL introduces audio representations solely via the attention mechanism within different layers of the LLM, bypassing its feedforward module. LAL encodes rich audio semantics at an appropriate level of abstraction for integration into different blocks of LLMs. Our design significantly reduces computational overhead compared to existing integration approaches. Observing with Whisper that the speech encoder benefits from PLITS integration, we propose an audio encoder aware approach for efficiently Probing Audio encoders via LLM (PAL), which employs PLITS integration for Whisper and LAL for general audio encoders. Under an identical training curriculum, LAL consistently maintains performance or outperforms existing integration approaches across multiple base LLMs and tasks. For general audio tasks, LAL improvement is up to 30% over a strong PLITS baseline while reducing memory usage by up to 64.1% and increasing throughput by up to 247.5%. Furthermore, for general audio-music-speech LLM, PAL performs on par with a fully PLITS integration-based system but with substantially improved computational and memory efficiency. Project page: https://ta012.github.io/PAL/

Related papers

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing [77.87631792556942]
SLAM-LLM is an open-source framework designed to train customized Multimodal Large Language Models (MLLMs)<n>It provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins.<n>It includes high-performance checkpoints like Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC)
arXiv Detail & Related papers (2026-01-14T11:25:36Z)
LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence [35.123477091633866]
LAMB is an audio captioning framework that bridges the modality gap between audio embeddings and the text embedding space.<n>A Cross-Modal Aligner minimizes Cauchy-Schwarz divergence while maximizing mutual information.<n>A Two-Stream Adapter that extracts semantically enriched audio embeddings delivers richer information to the Cross-Modal Aligner.
arXiv Detail & Related papers (2026-01-08T07:05:35Z)
Towards Audio Token Compression in Large Audio Language Models [26.379508239446935]
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks.<n>However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals.<n>This paper explores techniques to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder.
arXiv Detail & Related papers (2025-11-26T02:00:38Z)
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment [94.0709779805955]
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM)<n>It is designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning.<n>DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks.
arXiv Detail & Related papers (2025-07-03T16:28:25Z)
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model [85.72664004969182]
We introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks.<n>The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction.<n>Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence.
arXiv Detail & Related papers (2025-06-10T16:37:39Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
We introduce LISTEN, a contrastive-like training method designed to improve ALLMs' ability to distinguish between present and absent sounds.<n>We also extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption.<n> Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z)
Probing Audio-Generation Capabilities of Text-Based Language Models [5.4211188445379825]
This research investigates the extent to which Large Language Models can be prompted to generate audio.<n>We employ a three-tier approach, progressively increasing the complexity of audio generation.<n>Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases.
arXiv Detail & Related papers (2025-05-04T23:46:01Z)
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens [19.48089933713418]
We introduce a novel approach that combines Variational Quantization with Flow Matching to convert audio into ultra-low discrete tokens of 0.23kpbs.<n>Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events.
arXiv Detail & Related papers (2025-03-28T09:43:47Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders [36.528216873338614]
We propose to incorporate mixtures of weak' encoders into the AudioLLM framework.<n>MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size.<n>Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
arXiv Detail & Related papers (2024-09-10T16:46:18Z)
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio. Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z)
C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities. Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z)
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM) Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z)
Audio-Visual LLM for Video Understanding [25.963166809113005]
This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding. We introduce a high-quality video instruction dataset, derived from GPT-4. Experiments demonstrate that Audio-Visual LLM impressively achieves strong zero-shot results across a range of video understanding tasks.
arXiv Detail & Related papers (2023-12-11T02:50:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.