Speech-to-Singing Conversion in an Encoder-Decoder Framework
- URL: http://arxiv.org/abs/2002.06595v1
- Date: Sun, 16 Feb 2020 15:33:41 GMT
- Title: Speech-to-Singing Conversion in an Encoder-Decoder Framework
- Authors: Jayneel Parekh, Preeti Rao, Yi-Hsuan Yang
- Abstract summary: We take a learning based approach to the problem of converting spoken lines into sung ones.
We learn encodings that enable us to synthesize singing that preserves the linguistic content and timbre of the speaker.
- Score: 38.111942306157545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper our goal is to convert a set of spoken lines into sung ones.
Unlike previous signal processing based methods, we take a learning based
approach to the problem. This allows us to automatically model various aspects
of this transformation, thus overcoming dependence on specific inputs such as
high quality singing templates or phoneme-score synchronization information.
Specifically, we propose an encoder--decoder framework for our task. Given
time-frequency representations of speech and a target melody contour, we learn
encodings that enable us to synthesize singing that preserves the linguistic
content and timbre of the speaker while adhering to the target melody. We also
propose a multi-task learning based objective to improve lyric intelligibility.
We present a quantitative and qualitative analysis of our framework.
Related papers
- Discrete Unit based Masking for Improving Disentanglement in Voice Conversion [8.337649176647645]
We introduce a novel masking mechanism in the input before speaker encoding, masking certain discrete speech units that correspond highly with phoneme classes.
Our approach improves disentanglement and conversion performance across multiple VC methods, with 44% relative improvement in objective intelligibility.
arXiv Detail & Related papers (2024-09-17T21:17:59Z) - Singer Identity Representation Learning using Self-Supervised Techniques [0.0]
We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks.
We explore different self-supervised learning techniques on a large collection of isolated vocal tracks.
We evaluate the quality of the resulting representations on singer similarity and identification tasks.
arXiv Detail & Related papers (2024-01-10T10:41:38Z) - Towards General-Purpose Text-Instruction-Guided Voice Conversion [84.78206348045428]
This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice"
The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
arXiv Detail & Related papers (2023-09-25T17:52:09Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z) - VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN.
We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content.
By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Speech-to-Singing Conversion based on Boundary Equilibrium GAN [42.739822506085694]
This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one.
The proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.
arXiv Detail & Related papers (2020-05-28T08:18:02Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.