FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency
- URL: http://arxiv.org/abs/2410.03719v2
- Date: Sun, 08 Dec 2024 11:50:03 GMT
- Title: FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency
- Authors: Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li,
- Abstract summary: Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording.
Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance.
We propose a new fluency speech editing scheme based on our previous textitFluentEditor model, termed textittextbfFluentEditor2.
- Score: 40.95700389032375
- License:
- Abstract: Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous \textit{FluentEditor} model, termed \textit{\textbf{FluentEditor2}}, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose \textit{hierarchical local acoustic smoothness constraint} to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose \textit{contrastive global prosody consistency constraint} to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that \textit{FluentEditor2} surpasses existing neural networks-based TSE methods, including Editspeech, Campnet, A$^3$T, FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: \url{https://github.com/Ai-S2-Lab/FluentEditor2}.
Related papers
- DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency [20.3466261946094]
We introduce DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios.
We enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model.
We propose a first-order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech.
arXiv Detail & Related papers (2024-09-19T07:11:54Z) - FluentEditor: Text-based Speech Editing by Considering Acoustic and
Prosody Consistency [44.7425844190807]
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself.
We propose a fluency speech editing model, termed textitFluentEditor, by considering fluency-aware training criterion in the TSE training.
The subjective and objective experimental results on VCTK demonstrate that our textitFluentEditor outperforms all advanced baselines in terms of naturalness and fluency.
arXiv Detail & Related papers (2023-09-21T01:58:01Z) - Cross-Utterance Conditioned VAE for Speech Generation [27.5887600344053]
We present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation.
We propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing.
arXiv Detail & Related papers (2023-09-08T06:48:41Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet)
The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context.
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z) - EdiTTS: Score-based Editing for Controllable Text-to-Speech [9.34612743192798]
EdiTTS is an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis.
We apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model.
Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.
arXiv Detail & Related papers (2021-10-06T08:51:10Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.