Related papers: Latent Space Explorations of Singing Voice Synthesis using DDSP

Latent Space Explorations of Singing Voice Synthesis using DDSP

URL: http://arxiv.org/abs/2103.07197v1
Date: Fri, 12 Mar 2021 10:38:29 GMT
Title: Latent Space Explorations of Singing Voice Synthesis using DDSP
Authors: Juan Alonso and Cumhur Erkut
Abstract summary: Machine learning based singing voice models require large datasets and lengthy training times. We present a lightweight architecture that is able to output song-like utterances conditioned only on pitch and amplitude. We present two zero-configuration tools to train new models and experiment with them.
Score: 2.7920304852537527
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Machine learning based singing voice models require large datasets and lengthy training times. In this work we present a lightweight architecture, based on the Differentiable Digital Signal Processing (DDSP) library, that is able to output song-like utterances conditioned only on pitch and amplitude, after twelve hours of training using small datasets of unprocessed audio. The results are promising, as both the melody and the singer's voice are recognizable. In addition, we present two zero-configuration tools to train new models and experiment with them. Currently we are exploring the latent space representation, which is included in the DDSP library, but not in the original DDSP examples. Our results indicate that the latent space improves both the identification of the singer as well as the comprehension of the lyrics. Our code is available at https://github.com/juanalonso/DDSP-singing-experiments with links to the zero-configuration notebooks, and our sound examples are at https://juanalonso.github.io/DDSP-singing-experiments/ .

Related papers

DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model [48.57556892287629]
We propose a text-to-spatial-audio (TTSA) generation framework named DualSpec. It first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation.
arXiv Detail & Related papers (2025-02-26T09:01:59Z)
Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset [0.29998889086656577]
We show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music. We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling.
arXiv Detail & Related papers (2025-02-10T11:30:35Z)
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment [56.019288564115136]
We propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency.
arXiv Detail & Related papers (2024-04-14T18:00:05Z)
WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [7.4327407361824935]
We present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. We train a model that jointly learns text and audio representations and performs cross-modal retrieval.
arXiv Detail & Related papers (2023-12-14T18:38:02Z)
MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning [41.633972123961094]
Music2Vec is a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter.
arXiv Detail & Related papers (2022-12-05T16:04:26Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data. This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z)
Real-time Timbre Transfer and Sound Synthesis using DDSP [1.7942265700058984]
We present a real-time implementation of the MagentaP library embedded in a virtual synthesizer as a plug-in. We focused on timbre transfer from learned representations of real instruments to arbitrary sound inputs as well as controlling these models by MIDI. We developed a GUI for intuitive high-level controls which can be used for post-processing and manipulating the parameters estimated by the neural network.
arXiv Detail & Related papers (2021-03-12T11:49:51Z)
Anyone GAN Sing [0.0]
We present a method to synthesize the singing voice of a person using a Convolutional Long Short-term Memory (ConvLSTM) based GAN. Our work is inspired by WGANSing by Chandna et al.
arXiv Detail & Related papers (2021-02-22T14:30:58Z)
Fast accuracy estimation of deep learning based multi-class musical source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network. Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z)
dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains. This will also provide a means for evaluating algorithms specifically designed for music. The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z)
DeepSinger: Singing Voice Synthesis with Data Mined From the Web [194.10598657846145]
DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
arXiv Detail & Related papers (2020-07-09T07:00:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.