Related papers: LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

URL: http://arxiv.org/abs/2601.04658v1
Date: Thu, 08 Jan 2026 07:05:35 GMT
Title: LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
Authors: Hyeongkeun Lee, Jongmin Choi, KiHyun Nam, Joon Son Chung,
Abstract summary: LAMB is an audio captioning framework that bridges the modality gap between audio embeddings and the text embedding space.<n>A Cross-Modal Aligner minimizes Cauchy-Schwarz divergence while maximizing mutual information.<n>A Two-Stream Adapter that extracts semantically enriched audio embeddings delivers richer information to the Cross-Modal Aligner.
Score: 35.123477091633866
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps.

Related papers

Towards Audio Token Compression in Large Audio Language Models [26.379508239446935]
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks.<n>However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals.<n>This paper explores techniques to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder.
arXiv Detail & Related papers (2025-11-26T02:00:38Z)
VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion [7.96619533548369]
We present a framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper.<n>Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention.
arXiv Detail & Related papers (2025-09-19T06:42:42Z)
PAL: Probing Audio Encoders via LLMs - Audio Information Transfer into LLMs [29.049167884343998]
Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications.<n>We propose an efficient alternative, Lightweight Audio LLM Integration (LAL)<n>LAL encodes rich audio semantics at an appropriate level of abstraction for integration into different blocks of LLMs.
arXiv Detail & Related papers (2025-06-12T07:23:07Z)
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model [85.72664004969182]
We introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks.<n>The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction.<n>Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence.
arXiv Detail & Related papers (2025-06-10T16:37:39Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Probing Audio-Generation Capabilities of Text-Based Language Models [5.4211188445379825]
This research investigates the extent to which Large Language Models can be prompted to generate audio.<n>We employ a three-tier approach, progressively increasing the complexity of audio generation.<n>Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases.
arXiv Detail & Related papers (2025-05-04T23:46:01Z)
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [81.3306413498174]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber.
arXiv Detail & Related papers (2025-05-02T13:30:19Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities. Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z)
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.