LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio
- URL: http://arxiv.org/abs/2602.14612v2
- Date: Fri, 20 Feb 2026 09:24:27 GMT
- Title: LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio
- Authors: Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, Erik Visser,
- Abstract summary: LongAudio Generation (LARAG) is a framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections.<n>We demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment.<n>Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmentment.
- Score: 6.935416517354558
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.
Related papers
- Frame-Level Internal Tool Use for Temporal Grounding in Audio LMs [48.50855715191533]
Large audio language models are increasingly used for complex audio understanding tasks.<n>They struggle with temporal tasks that require precise temporal grounding, such as word alignment and speaker diarization.<n>We propose frame-level internal tool use, a method that trains audio LMs to use their own internal audio representations to perform temporal grounding directly.
arXiv Detail & Related papers (2026-02-10T19:19:52Z) - AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs [53.248502396225724]
AudioMarathon is a benchmark designed to evaluate both understanding and inference efficiency on long-form audio.<n>We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows.<n>The results show large gaps across current LALMs and highlight the need for better temporal reasoning.
arXiv Detail & Related papers (2025-10-08T17:50:16Z) - Beamforming-LLM: What, Where and When Did I Miss? [0.6655749439594806]
We present Beamforming-LLM, a system that enables users to semantically recall conversations they may have missed in multi-speaker environments.<n>The system combines spatial audio capture using a microphone array with retrieval-augmented generation (RAG) to support natural language queries.
arXiv Detail & Related papers (2025-09-07T21:52:26Z) - Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries [18.147981850263708]
We propose a query-based framework for open-vocabulary SED guided by multi-modal queries.<n>DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors from text or audio prompts.<n>DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting.
arXiv Detail & Related papers (2025-07-22T08:24:01Z) - VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [84.25283710008785]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z) - Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
Textless spoken language models struggle to generate plausible speech past tens of seconds.<n>We derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio.<n>SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency.
arXiv Detail & Related papers (2024-12-24T18:56:46Z) - OMCAT: Omni Context Aware Transformer [27.674943980306423]
OCTAV is a novel dataset designed to capture event transitions across audio and video.
OMCAT is a powerful model that leverages RoTE to enhance temporal grounding and computational efficiency in time-anchored tasks.
Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment.
arXiv Detail & Related papers (2024-10-15T23:16:28Z) - Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.<n>In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.<n>For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Language-based Audio Moment Retrieval [14.227865973426843]
We propose and design a new task called audio moment retrieval (AMR)<n>Unlike conventional language-based audio retrieval tasks, AMR aims to predict relevant moments in untrimmed long audio based on a text query.<n>We build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations.<n>We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks.
arXiv Detail & Related papers (2024-09-24T02:24:48Z) - Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning.<n>LALMs excel in general audio understanding, but are limited in temporal reasoning.<n>This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z) - Leveraging Language Model Capabilities for Sound Event Detection [10.792576135806623]
We propose an end-to-end framework for understanding audio features while simultaneously generating sound event and their temporal location.
Specifically, we employ pretrained acoustic models to capture discriminative features across different categories and language models for autoregressive text generation.
arXiv Detail & Related papers (2023-08-22T15:59:06Z) - Separate Anything You Describe [53.30484933564858]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)<n>AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.