Audio Latent Space Cartography
- URL: http://arxiv.org/abs/2212.02610v2
- Date: Wed, 7 Dec 2022 09:46:01 GMT
- Title: Audio Latent Space Cartography
- Authors: Nicolas Jonason, Bob L.T. Sturm
- Abstract summary: We explore the generation of visualisations of audio latent spaces using an audio-to-image generation pipeline.
We believe this can help with the interpretability of audio latent spaces.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the generation of visualisations of audio latent spaces using an
audio-to-image generation pipeline. We believe this can help with the
interpretability of audio latent spaces. We demonstrate a variety of results on
the NSynth dataset. A web demo is available.
Related papers
- ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model [2.2927722373373247]
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects.
arXiv Detail & Related papers (2024-10-19T02:28:53Z) - Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation [32.24603883810094]
Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.
We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources.
By leveraging spatial guidance, our unified model achieves the objective of generating immersive and controllable spatial audio from text and image.
arXiv Detail & Related papers (2024-10-14T16:18:29Z) - PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping [7.076417856575795]
A soundscape is defined by the acoustic environment a person perceives at a location.
We propose a framework for mapping soundscapes across the Earth.
We represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text.
arXiv Detail & Related papers (2024-08-13T17:37:40Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Novel-View Acoustic Synthesis [140.1107768313269]
We introduce the novel-view acoustic synthesis (NVAS) task.
given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint?
We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space.
arXiv Detail & Related papers (2023-01-20T18:49:58Z) - Sound-Guided Semantic Video Generation [15.225598817462478]
We propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space.
As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound.
arXiv Detail & Related papers (2022-04-20T07:33:10Z) - Learning Neural Acoustic Fields [110.22937202449025]
We introduce Neural Acoustic Fields (NAFs), an implicit representation that captures how sounds propagate in a physical scene.
By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs.
We demonstrate that the continuous nature of NAFs enables us to render spatial acoustics for a listener at an arbitrary location, and can predict sound propagation at novel locations.
arXiv Detail & Related papers (2022-04-04T17:59:37Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.