Towards zero-shot Text-based voice editing using acoustic context
conditioning, utterance embeddings, and reference encoders
- URL: http://arxiv.org/abs/2210.16045v1
- Date: Fri, 28 Oct 2022 10:31:44 GMT
- Title: Towards zero-shot Text-based voice editing using acoustic context
conditioning, utterance embeddings, and reference encoders
- Authors: Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo
K\"ohler, Qing He
- Abstract summary: Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording.
Recent work has used neural models to produce edited speech similar to the original speech in terms of clarity, speaker identity, and prosody.
This work focuses on the zero-shot approach which avoids finetuning altogether.
- Score: 14.723225542605105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based voice editing (TBVE) uses synthetic output from text-to-speech
(TTS) systems to replace words in an original recording. Recent work has used
neural models to produce edited speech that is similar to the original speech
in terms of clarity, speaker identity, and prosody. However, one limitation of
prior work is the usage of finetuning to optimise performance: this requires
further model training on data from the target speaker, which is a costly
process that may incorporate potentially sensitive data into server-side
models. In contrast, this work focuses on the zero-shot approach which avoids
finetuning altogether, and instead uses pretrained speaker verification
embeddings together with a jointly trained reference encoder to encode
utterance-level information that helps capture aspects such as speaker identity
and prosody. Subjective listening tests find that both utterance embeddings and
a reference encoder improve the continuity of speaker identity and prosody
between the edited synthetic speech and unedited original recording in the
zero-shot setting.
Related papers
- FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting.
Current TSE techniques focus on minimizing discrepancies between generated speech and reference targets within edited segments.
seamlessly integrating edited segments with unaltered portions of the audio remains challenging.
This paper introduces a novel approach, FluentEditor$tiny +$, designed to overcome these limitations.
arXiv Detail & Related papers (2024-09-28T10:18:35Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.