Music source separation conditioned on 3D point clouds
- URL: http://arxiv.org/abs/2102.02028v1
- Date: Wed, 3 Feb 2021 12:18:35 GMT
- Title: Music source separation conditioned on 3D point clouds
- Authors: Francesc Llu\'is, Vasileios Chatziioannou, Alex Hofmann
- Abstract summary: This paper proposes a multi-modal deep learning model to perform music source separation conditioned on 3D point clouds of music performance recordings.
It extracts visual features using 3D sparse convolutions, while audio features are extracted using dense convolutions.
A fusion module combines the extracted features to finally perform the audio source separation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, significant progress has been made in audio source separation by
the application of deep learning techniques. Current methods that combine both
audio and visual information use 2D representations such as images to guide the
separation process. However, in order to (re)-create acoustically correct
scenes for 3D virtual/augmented reality applications from recordings of real
music ensembles, detailed information about each sound source in the 3D
environment is required. This demand, together with the proliferation of 3D
visual acquisition systems like LiDAR or rgb-depth cameras, stimulates the
creation of models that can guide the audio separation using 3D visual
information. This paper proposes a multi-modal deep learning model to perform
music source separation conditioned on 3D point clouds of music performance
recordings. This model extracts visual features using 3D sparse convolutions,
while audio features are extracted using dense convolutions. A fusion module
combines the extracted features to finally perform the audio source separation.
It is shown, that the presented model can distinguish the musical instruments
from a single 3D point cloud frame, and perform source separation qualitatively
similar to a reference case, where manually assigned instrument labels are
provided.
Related papers
- Memory-based Adapters for Online 3D Scene Perception [71.71645534899905]
Conventional 3D scene perception methods are offline, i.e., take an already reconstructed 3D scene geometry as input.
We propose an adapter-based plug-and-play module for the backbone of 3D scene perception model.
Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks.
arXiv Detail & Related papers (2024-03-11T17:57:41Z) - Regulating Intermediate 3D Features for Vision-Centric Autonomous
Driving [26.03800936700545]
We propose to regulate intermediate dense 3D features with the help of volume rendering.
Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features.
arXiv Detail & Related papers (2023-12-19T04:09:05Z) - A Unified Framework for 3D Point Cloud Visual Grounding [60.75319271082741]
This paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3DRefTR.
Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch.
This elaborate design enables 3DRefTR to achieve both well-performing 3DRES and 3DREC capacities with only a 6% additional latency compared to the original 3DREC model.
arXiv Detail & Related papers (2023-08-23T03:20:31Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image
Collections [71.46546520120162]
Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging.
We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild.
We produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations.
arXiv Detail & Related papers (2023-06-07T17:47:50Z) - A Unified Audio-Visual Learning Framework for Localization, Separation,
and Recognition [26.828874753756523]
We propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition.
OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives.
Experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks.
arXiv Detail & Related papers (2023-05-30T23:53:12Z) - 3inGAN: Learning a 3D Generative Model from Images of a Self-similar
Scene [34.2144933185175]
3inGAN is an unconditional 3D generative model trained from 2D images of a single self-similar 3D scene.
We show results on semi-stochastic scenes of varying scale and complexity, obtained from real and synthetic sources.
arXiv Detail & Related papers (2022-11-27T18:03:21Z) - RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and
Generation [68.06991943974195]
We present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision.
We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images.
arXiv Detail & Related papers (2022-11-17T20:17:04Z) - Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source
Separation [36.38300120482868]
We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation.
ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
arXiv Detail & Related papers (2022-10-29T02:55:39Z) - DSGN++: Exploiting Visual-Spatial Relation forStereo-based 3D Detectors [60.88824519770208]
Camera-based 3D object detectors are welcome due to their wider deployment and lower price than LiDAR sensors.
We revisit the prior stereo modeling DSGN about the stereo volume constructions for representing both 3D geometry and semantics.
We propose our approach, DSGN++, aiming for improving information flow throughout the 2D-to-3D pipeline.
arXiv Detail & Related papers (2022-04-06T18:43:54Z) - Points2Sound: From mono to binaural audio using 3D point cloud scenes [0.0]
We propose Points2Sound, a multi-modal deep learning model which generates a version from mono audio using 3D point cloud scenes.
Results show that 3D visual information can successfully guide multi-modal deep learning models for the task of synthesis.
arXiv Detail & Related papers (2021-04-26T10:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.