Multi-Instrumentalist Net: Unsupervised Generation of Music from Body
Movements
- URL: http://arxiv.org/abs/2012.03478v1
- Date: Mon, 7 Dec 2020 06:54:10 GMT
- Title: Multi-Instrumentalist Net: Unsupervised Generation of Music from Body
Movements
- Authors: Kun Su, Xiulong Liu, Eli Shlizerman
- Abstract summary: We propose a novel system that takes as an input body movements of a musician playing a musical instrument and generates music in an unsupervised setting.
We build a pipeline named 'Multi-instrumentalistNet' that learns a discrete latent representation of various instruments music from log-spectrogram.
We show that a Midi can further condition the latent space such that the pipeline will generate the exact content of the music being played by the instrument in the video.
- Score: 20.627164135805852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel system that takes as an input body movements of a musician
playing a musical instrument and generates music in an unsupervised setting.
Learning to generate multi-instrumental music from videos without labeling the
instruments is a challenging problem. To achieve the transformation, we built a
pipeline named 'Multi-instrumentalistNet' (MI Net). At its base, the pipeline
learns a discrete latent representation of various instruments music from
log-spectrogram using a Vector Quantized Variational Autoencoder (VQ-VAE) with
multi-band residual blocks. The pipeline is then trained along with an
autoregressive prior conditioned on the musician's body keypoints movements
encoded by a recurrent neural network. Joint training of the prior with the
body movements encoder succeeds in the disentanglement of the music into latent
features indicating the musical components and the instrumental features. The
latent space results in distributions that are clustered into distinct
instruments from which new music can be generated. Furthermore, the VQ-VAE
architecture supports detailed music generation with additional conditioning.
We show that a Midi can further condition the latent space such that the
pipeline will generate the exact content of the music being played by the
instrument in the video. We evaluate MI Net on two datasets containing videos
of 13 instruments and obtain generated music of reasonable audio quality,
easily associated with the corresponding instrument, and consistent with the
music audio content.
Related papers
- Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Show Me the Instruments: Musical Instrument Retrieval from Mixture Audio [11.941510958668557]
We call this task as Musical Instrument Retrieval.
We propose a method for retrieving desired musical instruments using reference music mixture as a query.
The proposed model consists of the Single-Instrument and the Multi-Instrument, both based on convolutional neural networks.
arXiv Detail & Related papers (2022-11-15T07:32:39Z) - Musika! Fast Infinite Waveform Music Generation [0.0]
We introduce Musika, a music generation system that can be trained on hundreds of hours of music using a single consumer GPU.
We achieve this by first learning a compact invertible representation of spectrogram magnitudes and phases with adversarial autoencoders.
A latent coordinate system enables generating arbitrarily long sequences of excerpts in parallel, while a global context vector allows the music to remain stylistically coherent through time.
arXiv Detail & Related papers (2022-08-18T08:31:15Z) - Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model.
A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens.
Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z) - MusIAC: An extensible generative framework for Music Infilling
Applications with multi-level Control [11.811562596386253]
Infilling refers to the task of generating musical sections given the surrounding multi-track music.
The proposed framework is for new control tokens as the added control tokens such as tonal tension per bar and track polyphony level.
We present the model in a Google Colab notebook to enable interactive generation.
arXiv Detail & Related papers (2022-02-11T10:02:21Z) - Towards Automatic Instrumentation by Learning to Separate Parts in
Symbolic Multitrack Music [33.679951600368405]
We study the feasibility of automatic instrumentation -- dynamically assigning instruments to notes in solo music during performance.
In addition to the online, real-time-capable setting for performative use cases, automatic instrumentation can also find applications in assistive composing tools in an offline setting.
We frame the task of part separation as a sequential multi-class classification problem and adopt machine learning to map sequences of notes into sequences of part labels.
arXiv Detail & Related papers (2021-07-13T08:34:44Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - Foley Music: Learning to Generate Music from Videos [115.41099127291216]
Foley Music is a system that can synthesize plausible music for a silent video clip about people playing musical instruments.
We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings.
We present a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements.
arXiv Detail & Related papers (2020-07-21T17:59:06Z) - Audeo: Audio Generation for a Silent Performance Video [17.705770346082023]
We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video.
Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events.
arXiv Detail & Related papers (2020-06-23T00:58:59Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.