Related papers: Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

URL: http://arxiv.org/abs/2602.16334v1
Date: Wed, 18 Feb 2026 10:16:30 GMT
Title: Spatial Audio Question Answering and Reasoning on Dynamic Source Movements
Authors: Arvind Krishna Sridhar, Yinyi Guo, Erik Visser,
Abstract summary: We introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events.<n>Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps.<n>Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes.
Score: 8.029049649310211
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spatial AQA) with a focus on movement reasoning, where a model must infer object motion, position, and directional changes directly from stereo audio. First, we introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events, enabling controlled and scalable training data generation. Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps before predicting an answer. Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes: no masking, an audio grounding model (AGM), and ground-truth masks. Our results show that reasoning amplifies the benefits of source separation, with thinking mode showing significant improvement of +5.1% when a single event is present in the question. These findings highlight the interplay between movement modeling, reasoning, and separation quality, offering new insights for advancing spatial audio understanding.

Related papers

Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs [39.209987830131816]
Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering.<n>Yet, whether these models can perceive dynamics, particularly the motion of sound sources, remains unclear.<n>We introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding.
arXiv Detail & Related papers (2025-11-17T11:45:41Z)
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning [124.19449187588832]
Unified Audio Language Model (UALM) aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model.<n>We first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models.<n>We present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks.
arXiv Detail & Related papers (2025-10-13T22:55:01Z)
Spatial Audio Motion Understanding and Reasoning [8.029049649310211]
spatial audio reasoning enables machines to interpret auditory scenes by understanding events and their spatial attributes.<n>We introduce a spatial audio encoder that processes spatial audio to detect multiple overlapping events and estimate their spatial attributes, Direction of Arrival (DoA) and source distance, at the frame level.<n>Second, to answer complex queries about dynamic audio scenes involving moving sources, we condition a large language model (LLM) on structured spatial attributes extracted by our model.
arXiv Detail & Related papers (2025-09-18T06:53:22Z)
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.84031769492708]
This task defines three QA subsets to test audio-language models on interactive question-answering over diverse acoustic scenes.<n>Preliminary results on the development set are compared, showing strong variation across models and subsets.<n>This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity.
arXiv Detail & Related papers (2025-05-12T09:04:16Z)
Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis [56.749927786910554]
We propose a novel framework that integrates Gaussian Splatting with a structured Audio Factorization Plane (Audio-Plane) to enable high-quality, audio-synchronized, and real-time talking head generation.<n>Our method achieves state-of-the-art visual quality, precise audio-lip synchronization, and real-time performance, outperforming prior approaches across both 2D- and 3D-based paradigms.
arXiv Detail & Related papers (2025-03-28T16:50:27Z)
Unveiling and Mitigating Bias in Audio Visual Segmentation [9.427676046134374]
Community researchers have developed a range of advanced audio-visual segmentation models to improve the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding.
arXiv Detail & Related papers (2024-07-23T16:55:04Z)
A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest. We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time. We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z)
Audio-visual Speech Separation with Adversarially Disentangled Visual Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers. In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem. Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.