Related papers: Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

URL: http://arxiv.org/abs/2510.15231v1
Date: Fri, 17 Oct 2025 01:44:28 GMT
Title: Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
Authors: Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul,
Abstract summary: Partial YaRN is a training-free, audio-only context extension method for large audio-language models (LALMs)<n>VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training.<n>Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings.
Score: 13.333718377388713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.

Related papers

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning [39.264735719707154]
Current efforts replicate text-based reasoning by contextualizing audio content through a one-time encoding.<n>We propose audio-interleaved reasoning to break through this bottleneck.<n>We present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning.
arXiv Detail & Related papers (2026-02-12T13:06:34Z)
Beyond Transcripts: A Renewed Perspective on Audio Chaptering [66.61445564139052]
We show that a novel audio-only architecture (AudioSeg) outperforms text-based approaches for segmenting long-form audio into coherent sections.<n>Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following.
arXiv Detail & Related papers (2026-02-09T18:28:10Z)
SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training [31.192251626550203]
We introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations.<n>SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations.
arXiv Detail & Related papers (2026-01-18T21:36:19Z)
FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing [48.84039953531356]
FastLongSpeech is designed to extend LSLM capabilities for efficient long-speech processing.<n>It incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths.<n>Our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
arXiv Detail & Related papers (2025-07-20T04:11:06Z)
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment [94.0709779805955]
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM)<n>It is designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning.<n>DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks.
arXiv Detail & Related papers (2025-07-03T16:28:25Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
BLAB: Brutally Long Audio Bench [90.20616799311578]
Brutally Long Audio Bench (BLAB) is a long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks.<n>BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers.<n>We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks.
arXiv Detail & Related papers (2025-05-05T22:28:53Z)
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.<n>In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.<n>For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z)
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning.<n>LALMs excel in general audio understanding, but are limited in temporal reasoning.<n>This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.