Modeling the Compatibility of Stem Tracks to Generate Music Mashups
- URL: http://arxiv.org/abs/2103.14208v1
- Date: Fri, 26 Mar 2021 01:51:11 GMT
- Title: Modeling the Compatibility of Stem Tracks to Generate Music Mashups
- Authors: Jiawen Huang, Ju-Chiang Wang, Jordan B. L. Smith, Xuchen Song, Yuxuan
Wang
- Abstract summary: A music mashup combines audio elements from two or more songs to create a new work.
Research has developed algorithms that predict the compatibility of audio elements.
- Score: 6.922825755771942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A music mashup combines audio elements from two or more songs to create a new
work. To reduce the time and effort required to make them, researchers have
developed algorithms that predict the compatibility of audio elements. Prior
work has focused on mixing unaltered excerpts, but advances in source
separation enable the creation of mashups from isolated stems (e.g., vocals,
drums, bass, etc.). In this work, we take advantage of separated stems not just
for creating mashups, but for training a model that predicts the mutual
compatibility of groups of excerpts, using self-supervised and semi-supervised
methods. Specifically, we first produce a random mashup creation pipeline that
combines stem tracks obtained via source separation, with key and tempo
automatically adjusted to match, since these are prerequisites for high-quality
mashups. To train a model to predict compatibility, we use stem tracks obtained
from the same song as positive examples, and random combinations of stems with
key and/or tempo unadjusted as negative examples. To improve the model and use
more data, we also train on "average" examples: random combinations with
matching key and tempo, where we treat them as unlabeled data as their true
compatibility is unknown. To determine whether the combined signal or the set
of stem signals is more indicative of the quality of the result, we experiment
on two model architectures and train them using semi-supervised learning
technique. Finally, we conduct objective and subjective evaluations of the
system, comparing them to a standard rule-based system.
Related papers
- Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Benchmarks and leaderboards for sound demixing tasks [44.99833362998488]
We introduce two new benchmarks for the sound source separation tasks.
We compare popular models for sound demixing, as well as their ensembles, on these benchmarks.
We also develop a novel approach for audio separation, based on the ensembling of different models that are suited best for the particular stem.
arXiv Detail & Related papers (2023-05-12T14:00:26Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Mix and Localize: Localizing Sound Sources in Mixtures [10.21507741240426]
We present a method for simultaneously localizing multiple sound sources within a visual scene.
Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al.
We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds.
arXiv Detail & Related papers (2022-11-28T04:30:50Z) - Synthesizer Preset Interpolation using Transformer Auto-Encoders [4.213427823201119]
We introduce a bimodal auto-encoder neural network, which simultaneously processes presets using multi-head attention blocks, and audio using convolutions.
This model has been tested on a popular frequency modulation synthesizer with more than one hundred parameters.
After training, the proposed model can be integrated into commercial synthesizers for live or sound design tasks.
arXiv Detail & Related papers (2022-10-27T15:20:18Z) - Improved singing voice separation with chromagram-based pitch-aware
remixing [26.299721372221736]
We propose chromagram-based pitch-aware remixing, where music segments with high pitch alignment are mixed.
We demonstrate that training models with pitch-aware remixing significantly improves the test signal-to-distortion ratio (SDR)
arXiv Detail & Related papers (2022-03-28T20:55:54Z) - A Unified Model for Zero-shot Music Source Separation, Transcription and
Synthesis [13.263771543118994]
We propose a unified model for three inter-related tasks: 1) to textitseparate individual sound sources from a mixed music audio, 2) to textittranscribe each sound source to MIDI notes, and 3) totextit synthesize new pieces based on the timbre of separated sources.
The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre.
arXiv Detail & Related papers (2021-08-07T14:28:21Z) - A framework to compare music generative models using automatic
evaluation metrics extended to rhythm [69.2737664640826]
This paper takes the framework proposed in a previous research that did not consider rhythm to make a series of design decisions, then, rhythm support is added to evaluate the performance of two RNN memory cells in the creation of monophonic music.
The model considers the handling of music transposition and the framework evaluates the quality of the generated pieces using automatic quantitative metrics based on geometry which have rhythm support added as well.
arXiv Detail & Related papers (2021-01-19T15:04:46Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement
Learning [69.20460466735852]
This paper presents a deep reinforcement learning algorithm for online accompaniment generation.
The proposed algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part.
arXiv Detail & Related papers (2020-02-08T03:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.