Towards end-to-end F0 voice conversion based on Dual-GAN with
convolutional wavelet kernels
- URL: http://arxiv.org/abs/2104.07283v1
- Date: Thu, 15 Apr 2021 07:42:59 GMT
- Title: Towards end-to-end F0 voice conversion based on Dual-GAN with
convolutional wavelet kernels
- Authors: Cl\'ement Le Moine Veillon, Nicolas Obin and Axel Roebel
- Abstract summary: A single neural network is proposed, in which a first module is used to learn F0 representation over different temporal scales.
A second adversarial module is used to learn the transformation from one emotion to another.
- Score: 11.92436948211501
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents a end-to-end framework for the F0 transformation in the
context of expressive voice conversion. A single neural network is proposed, in
which a first module is used to learn F0 representation over different temporal
scales and a second adversarial module is used to learn the transformation from
one emotion to another. The first module is composed of a convolution layer
with wavelet kernels so that the various temporal scales of F0 variations can
be efficiently encoded. The single decomposition/transformation network allows
to learn in a end-to-end manner the F0 decomposition that are optimal with
respect to the transformation, directly from the raw F0 signal.
Related papers
- SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model [31.280358048556444]
This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism.
The proposed system also incorporates elements like the fundamental pitch (F0) predictor and waveform generation decoder.
Experiments on the Opencpop dataset demonstrate efficacy of the proposed model in intonation quality and accuracy.
arXiv Detail & Related papers (2024-10-16T13:18:45Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band
Generation and Inverse Short-Time Fourier Transform [9.606821628015933]
We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform.
Experimental results show that our model synthesized speech as natural as that synthesized by VITS.
A smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed.
arXiv Detail & Related papers (2022-10-28T08:15:05Z) - Machine Learning model for gas-liquid interface reconstruction in CFD
numerical simulations [59.84561168501493]
The volume of fluid (VoF) method is widely used in multi-phase flow simulations to track and locate the interface between two immiscible fluids.
A major bottleneck of the VoF method is the interface reconstruction step due to its high computational cost and low accuracy on unstructured grids.
We propose a machine learning enhanced VoF method based on Graph Neural Networks (GNN) to accelerate the interface reconstruction on general unstructured meshes.
arXiv Detail & Related papers (2022-07-12T17:07:46Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with
CycleGAN [81.79070894458322]
Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages.
Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer.
We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions.
arXiv Detail & Related papers (2020-08-11T07:29:55Z) - Multi-speaker Emotion Conversion via Latent Variable Regularization and
a Chained Encoder-Decoder-Predictor Network [18.275646344620387]
We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture.
We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech.
arXiv Detail & Related papers (2020-07-25T13:59:22Z) - Volumetric Transformer Networks [88.85542905676712]
We introduce a learnable module, the volumetric transformer network (VTN)
VTN predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely.
Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval.
arXiv Detail & Related papers (2020-07-18T14:00:12Z) - Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data [91.92456020841438]
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data.
We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
arXiv Detail & Related papers (2020-02-01T12:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.