Related papers: SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction

SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction

URL: http://arxiv.org/abs/2506.00273v1
Date: Fri, 30 May 2025 22:15:10 GMT
Title: SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction
Authors: Tuochao Chen, D Shin, Hakan Erdogan, Sinan Hersek,
Abstract summary: SoundSculpt is a neural network designed to extract target sound fields from ambisonic recordings.<n>SoundSculpt employs an ambisonic-in-ambisonic-out architecture and is conditioned on both spatial information and semantic embeddings.
Score: 5.989764659998189
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces SoundSculpt, a neural network designed to extract target sound fields from ambisonic recordings. SoundSculpt employs an ambisonic-in-ambisonic-out architecture and is conditioned on both spatial information (e.g., target direction obtained by pointing at an immersive video) and semantic embeddings (e.g., derived from image segmentation and captioning). Trained and evaluated on synthetic and real ambisonic mixtures, SoundSculpt demonstrates superior performance compared to various signal processing baselines. Our results further reveal that while spatial conditioning alone can be effective, the combination of spatial and semantic information is beneficial in scenarios where there are secondary sound sources spatially close to the target. Additionally, we compare two different semantic embeddings derived from a text description of the target sound using text encoders.

Related papers

Sci-Phi: A Large Language Model Spatial Audio Descriptor [25.302416479626974]
Sci-Phi is a spatial audio model with dual spatial and spectral encoders.<n>It enumerates and describes up to four directional sound sources in one pass.<n>It generalizes to real room impulse responses with only minor performance degradation.
arXiv Detail & Related papers (2025-10-07T03:06:02Z)
SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping [7.291750095728984]
We present Sat2Sound, a framework to predict the distribution of sounds at any location on Earth.<n>Our approach incorporates contrastive learning across audio, audio captions, satellite images, and satellite image captions.<n>We introduce a novel application: location-based soundscape synthesis, which enables immersive acoustic experiences.
arXiv Detail & Related papers (2025-05-19T23:36:04Z)
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model [2.2927722373373247]
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects.
arXiv Detail & Related papers (2024-10-19T02:28:53Z)
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.<n>Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.<n>We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.<n>Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z)
BAT: Learning to Reason about Spatial Sounds with Large Language Models [45.757161909533714]
We present BAT, which combines the sound perception ability of a spatial scene analysis model with the natural language reasoning capabilities of a large language model (LLM)<n>Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning.
arXiv Detail & Related papers (2024-02-02T17:34:53Z)
Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization [13.278494654137138]
Humans utilize both audio and visual modalities as spatial cues to locate sound sources. We propose an audio-visual spatial integration network that integrates spatial cues from both modalities. Our method can perform more robust sound source localization.
arXiv Detail & Related papers (2023-08-11T11:57:58Z)
Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z)
Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video. Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.