Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model
- URL: http://arxiv.org/abs/2304.11976v1
- Date: Mon, 24 Apr 2023 10:15:58 GMT
- Title: Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model
- Authors: Kenichi Fujita, Takanori Ashihara, Hiroki Kanagawa, Takafumi Moriya,
Yusuke Ijima
- Abstract summary: A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
- Score: 13.572330725278066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a zero-shot text-to-speech (TTS) conditioned by a
self-supervised speech-representation model acquired through self-supervised
learning (SSL). Conventional methods with embedding vectors from x-vector or
global style tokens still have a gap in reproducing the speaker characteristics
of unseen speakers. A novel point of the proposed method is the direct use of
the SSL model to obtain embedding vectors from speech representations trained
with a large amount of data. We also introduce the separate conditioning of
acoustic features and a phoneme duration predictor to obtain the disentangled
embeddings between rhythm-based speaker characteristics and
acoustic-feature-based ones. The disentangled embeddings will enable us to
achieve better reproduction performance for unseen speakers and rhythm transfer
conditioned by different speeches. Objective and subjective evaluations showed
that the proposed method can synthesize speech with improved similarity and
achieve speech-rhythm transfer.
Related papers
- Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and
Phoneme Duration for Multi-Speaker Speech Synthesis [16.497022070614236]
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker.
A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm.
arXiv Detail & Related papers (2024-02-11T02:26:43Z) - Noise-robust zero-shot text-to-speech synthesis conditioned on
self-supervised speech-representation model with adapters [47.75276947690528]
The zero-shot text-to-speech (TTS) method can reproduce speaker characteristics very accurately.
However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise.
In this paper, we propose a noise-robust zero-shot TTS method.
arXiv Detail & Related papers (2024-01-10T12:21:21Z) - SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples.
We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model.
Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z) - High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - ESSumm: Extractive Speech Summarization from Untranscribed Meeting [7.309214379395552]
We propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm.
We leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio.
Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length.
arXiv Detail & Related papers (2022-09-14T20:13:15Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.