Music source separation conditioned on 3D point clouds
- URL: http://arxiv.org/abs/2102.02028v1
- Date: Wed, 3 Feb 2021 12:18:35 GMT
- Title: Music source separation conditioned on 3D point clouds
- Authors: Francesc Llu\'is, Vasileios Chatziioannou, Alex Hofmann
- Abstract summary: This paper proposes a multi-modal deep learning model to perform music source separation conditioned on 3D point clouds of music performance recordings.
It extracts visual features using 3D sparse convolutions, while audio features are extracted using dense convolutions.
A fusion module combines the extracted features to finally perform the audio source separation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, significant progress has been made in audio source separation by
the application of deep learning techniques. Current methods that combine both
audio and visual information use 2D representations such as images to guide the
separation process. However, in order to (re)-create acoustically correct
scenes for 3D virtual/augmented reality applications from recordings of real
music ensembles, detailed information about each sound source in the 3D
environment is required. This demand, together with the proliferation of 3D
visual acquisition systems like LiDAR or rgb-depth cameras, stimulates the
creation of models that can guide the audio separation using 3D visual
information. This paper proposes a multi-modal deep learning model to perform
music source separation conditioned on 3D point clouds of music performance
recordings. This model extracts visual features using 3D sparse convolutions,
while audio features are extracted using dense convolutions. A fusion module
combines the extracted features to finally perform the audio source separation.
It is shown, that the presented model can distinguish the musical instruments
from a single 3D point cloud frame, and perform source separation qualitatively
similar to a reference case, where manually assigned instrument labels are
provided.
Related papers
- 3D Audio-Visual Segmentation [44.61476023587931]
Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR.
We propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models.
Experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI.
arXiv Detail & Related papers (2024-11-04T16:30:14Z) - Memory-based Adapters for Online 3D Scene Perception [71.71645534899905]
Conventional 3D scene perception methods are offline, i.e., take an already reconstructed 3D scene geometry as input.
We propose an adapter-based plug-and-play module for the backbone of 3D scene perception model.
Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks.
arXiv Detail & Related papers (2024-03-11T17:57:41Z) - ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data.
We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z) - Regulating Intermediate 3D Features for Vision-Centric Autonomous
Driving [26.03800936700545]
We propose to regulate intermediate dense 3D features with the help of volume rendering.
Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features.
arXiv Detail & Related papers (2023-12-19T04:09:05Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image
Collections [71.46546520120162]
Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging.
We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild.
We produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations.
arXiv Detail & Related papers (2023-06-07T17:47:50Z) - 3inGAN: Learning a 3D Generative Model from Images of a Self-similar
Scene [34.2144933185175]
3inGAN is an unconditional 3D generative model trained from 2D images of a single self-similar 3D scene.
We show results on semi-stochastic scenes of varying scale and complexity, obtained from real and synthetic sources.
arXiv Detail & Related papers (2022-11-27T18:03:21Z) - RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and
Generation [68.06991943974195]
We present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision.
We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images.
arXiv Detail & Related papers (2022-11-17T20:17:04Z) - Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source
Separation [36.38300120482868]
We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation.
ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
arXiv Detail & Related papers (2022-10-29T02:55:39Z) - DSGN++: Exploiting Visual-Spatial Relation forStereo-based 3D Detectors [60.88824519770208]
Camera-based 3D object detectors are welcome due to their wider deployment and lower price than LiDAR sensors.
We revisit the prior stereo modeling DSGN about the stereo volume constructions for representing both 3D geometry and semantics.
We propose our approach, DSGN++, aiming for improving information flow throughout the 2D-to-3D pipeline.
arXiv Detail & Related papers (2022-04-06T18:43:54Z) - Points2Sound: From mono to binaural audio using 3D point cloud scenes [0.0]
We propose Points2Sound, a multi-modal deep learning model which generates a version from mono audio using 3D point cloud scenes.
Results show that 3D visual information can successfully guide multi-modal deep learning models for the task of synthesis.
arXiv Detail & Related papers (2021-04-26T10:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.