Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using
Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation
- URL: http://arxiv.org/abs/2204.10020v1
- Date: Thu, 21 Apr 2022 11:03:37 GMT
- Title: Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using
Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation
- Authors: Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata,
Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana
- Abstract summary: We propose a novel data augmentation method that combines pitch-shifting and VC techniques.
Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models.
Subjective test results showed that a FastSpeech 2-based emotional TTS system with the proposed method improved naturalness and emotional similarity compared with conventional methods.
- Score: 19.807274303199755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data augmentation via voice conversion (VC) has been successfully applied to
low-resource expressive text-to-speech (TTS) when only neutral data for the
target speaker are available. Although the quality of VC is crucial for this
approach, it is challenging to learn a stable VC model because the amount of
data is limited in low-resource scenarios, and highly expressive speech has
large acoustic variety. To address this issue, we propose a novel data
augmentation method that combines pitch-shifting and VC techniques. Because
pitch-shift data augmentation enables the coverage of a variety of pitch
dynamics, it greatly stabilizes training for both VC and TTS models, even when
only 1,000 utterances of the target speaker's neutral data are available.
Subjective test results showed that a FastSpeech 2-based emotional TTS system
with the proposed method improved naturalness and emotional similarity compared
with conventional methods.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for
Robust Polyglot Text-To-Speech [6.243356997302935]
We introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model.
In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker.
In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model.
arXiv Detail & Related papers (2023-09-15T09:03:14Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Cross-speaker style transfer for text-to-speech using data augmentation [11.686745250628247]
We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion.
We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers.
We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages.
arXiv Detail & Related papers (2022-02-10T15:10:56Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z) - Low-resource expressive text-to-speech using data augmentation [12.396086122947679]
We present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data.
We augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers.
Next, we use that synthetic data on top of the available recordings to train a TTS model.
arXiv Detail & Related papers (2020-11-11T11:22:37Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.