Related papers: In-the-wild Audio Spatialization with Flexible Text-guided Localization

In-the-wild Audio Spatialization with Flexible Text-guided Localization

URL: http://arxiv.org/abs/2506.00927v1
Date: Sun, 01 Jun 2025 09:41:56 GMT
Title: In-the-wild Audio Spatialization with Flexible Text-guided Localization
Authors: Tianrui Pan, Jie Liu, Zewen Huang, Jie Tang, Gangshan Wu,
Abstract summary: To enhance immersive experiences, audio offers spatial awareness of sounding objects in AR, VR, and embodied AI applications.<n>While existing audio spatialization methods can generally map any available monaural audio to audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments.<n>We propose a Text-guided Audio Spatialization (TAS) framework that utilizes flexible text prompts and evaluates our model from unified generation and comprehension perspectives.
Score: 37.60344400859993
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To enhance immersive experiences, binaural audio offers spatial awareness of sounding objects in AR, VR, and embodied AI applications. While existing audio spatialization methods can generally map any available monaural audio to binaural audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments. To address this, we propose a Text-guided Audio Spatialization (TAS) framework that utilizes flexible text prompts and evaluates our model from unified generation and comprehension perspectives. Due to the limited availability of premium and large-scale stereo data, we construct the SpatialTAS dataset, which encompasses 376,000 simulated binaural audio samples to facilitate the training of our model. Our model learns binaural differences guided by 3D spatial location and relative position prompts, augmented by flipped-channel audio. It outperforms existing methods on both simulated and real-recorded datasets, demonstrating superior generalization and accuracy. Besides, we develop an assessment model based on Llama-3.1-8B, which evaluates the spatial semantic coherence between our generated binaural audio and text prompts through a spatial reasoning task. Results demonstrate that text prompts provide flexible and interactive control to generate binaural audio with excellent quality and semantic consistency in spatial locations. Dataset is available at \href{https://github.com/Alice01010101/TASU}

Related papers

AudioScene: Integrating Object-Event Audio into 3D Scenes [19.66595321540055]
We present two novel audiospatial scene datasets, AudioScanNet and AudioRoboTHOR.<n>By integrating audio clips with spatially aligned 3D scenes, our datasets enable research on how audio signals interact with spatial context.
arXiv Detail & Related papers (2025-11-25T14:28:13Z)
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation [32.24603883810094]
Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.<n>We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions.<n>By leveraging spatial guidance, our model achieves the objective of generating immersive and controllable spatial audio from text.
arXiv Detail & Related papers (2024-10-14T16:18:29Z)
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.<n>Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.<n>We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.<n>Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z)
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities. RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z)
BAT: Learning to Reason about Spatial Sounds with Large Language Models [45.757161909533714]
We present BAT, which combines the sound perception ability of a spatial scene analysis model with the natural language reasoning capabilities of a large language model (LLM)<n>Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning.
arXiv Detail & Related papers (2024-02-02T17:34:53Z)
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF. We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z)
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z)
Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings. We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios. Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.