Related papers: FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

URL: http://arxiv.org/abs/2309.11725v2
Date: Fri, 22 Sep 2023 02:05:36 GMT
Title: FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency
Authors: Rui Liu, Jiatian Xi, Ziyue Jiang and Haizhou Li
Abstract summary: Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. We propose a fluency speech editing model, termed textitFluentEditor, by considering fluency-aware training criterion in the TSE training. The subjective and objective experimental results on VCTK demonstrate that our textitFluentEditor outperforms all advanced baselines in terms of naturalness and fluency.
Score: 44.7425844190807
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. Despite much progress in neural network-based TSE techniques, the current techniques have focused on reducing the difference between the generated speech segment and the reference target in the editing region, ignoring its local and global fluency in the context and original utterance. To maintain the speech fluency, we propose a fluency speech editing model, termed \textit{FluentEditor}, by considering fluency-aware training criterion in the TSE training. Specifically, the \textit{acoustic consistency constraint} aims to smooth the transition between the edited region and its neighboring acoustic segments consistent with the ground truth, while the \textit{prosody consistency constraint} seeks to ensure that the prosody attributes within the edited regions remain consistent with the overall style of the original utterance. The subjective and objective experimental results on VCTK demonstrate that our \textit{FluentEditor} outperforms all advanced baselines in terms of naturalness and fluency. The audio samples and code are available at \url{https://github.com/Ai-S2-Lab/FluentEditor}.

Related papers

Text-Queried Audio Source Separation via Hierarchical Modeling [53.94434504259829]
We propose a hierarchical decomposition framework, HSM-TSS, that decouples the task into global-local semantic-guided feature separation and structure-preserving acoustic reconstruction.<n>A Q-Audio architecture is employed to align audio and text modalities, serving as pretrained global-semantic encoders.<n>Our method achieves state-of-the-art separation performance with data-efficient training while maintaining superior semantic consistency with queries in complex auditory scenes.
arXiv Detail & Related papers (2025-05-27T11:00:38Z)
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset [52.95197015472105]
EmoCorrector is a novel post-correction scheme for text-based speech editing.<n>It retrieves the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion.<n>EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods.
arXiv Detail & Related papers (2025-05-24T16:10:56Z)
FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting. Current TSE techniques focus on minimizing discrepancies between generated speech and reference targets within edited segments. seamlessly integrating edited segments with unaltered portions of the audio remains challenging. This paper introduces a novel approach, FluentEditor$tiny +$, designed to overcome these limitations.
arXiv Detail & Related papers (2024-09-28T10:18:35Z)
DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency [20.3466261946094]
We introduce DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios. We enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. We propose a first-order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech.
arXiv Detail & Related papers (2024-09-19T07:11:54Z)
Voice Attribute Editing with Text Prompt [48.48628304530097]
This paper introduces a novel task: voice attribute editing with text prompt. The goal is to make relative modifications to voice attributes according to the actions described in the text prompt. To solve this task, VoxEditor, an end-to-end generative model, is proposed.
arXiv Detail & Related papers (2024-04-13T00:07:40Z)
CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet) The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context. It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z)
EdiTTS: Score-based Editing for Controllable Text-to-Speech [9.34612743192798]
EdiTTS is an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. We apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model. Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.
arXiv Detail & Related papers (2021-10-06T08:51:10Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions. We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.