Context-Aware Prosody Correction for Text-Based Speech Editing
- URL: http://arxiv.org/abs/2102.08328v1
- Date: Tue, 16 Feb 2021 18:16:30 GMT
- Title: Context-Aware Prosody Correction for Text-Based Speech Editing
- Authors: Max Morrison, Lucas Rencker, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo
Caceres, Bryan Pardo
- Abstract summary: A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
- Score: 28.459695630420832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based speech editors expedite the process of editing speech recordings
by permitting editing via intuitive cut, copy, and paste operations on a speech
transcript. A major drawback of current systems, however, is that edited
recordings often sound unnatural because of prosody mismatches around edited
regions. In our work, we propose a new context-aware method for more natural
sounding text-based editing of speech. To do so, we 1) use a series of neural
networks to generate salient prosody features that are dependent on the prosody
of speech surrounding the edit and amenable to fine-grained user control 2) use
the generated features to control a standard pitch-shift and time-stretch
method and 3) apply a denoising neural network to remove artifacts induced by
the signal manipulation to yield a high-fidelity result. We evaluate our
approach using a subjective listening test, provide a detailed comparative
analysis, and conclude several interesting insights.
Related papers
- Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits [82.8859060022651]
We introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox.
Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods.
Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization.
arXiv Detail & Related papers (2025-01-07T14:17:47Z) - FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording.
Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance.
We propose a new fluency speech editing scheme based on our previous textitFluentEditor model, termed textittextbfFluentEditor2.
arXiv Detail & Related papers (2024-09-28T10:18:35Z) - Speech Editing -- a Summary [8.713498822221222]
This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing.
The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
arXiv Detail & Related papers (2024-07-24T11:22:57Z) - FluentEditor: Text-based Speech Editing by Considering Acoustic and
Prosody Consistency [44.7425844190807]
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself.
We propose a fluency speech editing model, termed textitFluentEditor, by considering fluency-aware training criterion in the TSE training.
The subjective and objective experimental results on VCTK demonstrate that our textitFluentEditor outperforms all advanced baselines in terms of naturalness and fluency.
arXiv Detail & Related papers (2023-09-21T01:58:01Z) - Emotion Selectable End-to-End Text-based Speech Editing [63.346825713704625]
Emo-CampNet (emotion CampNet) is an emotion-selectable text-based speech editing model.
It can effectively control the emotion of the generated speech in the process of text-based speech editing.
It can also edit unseen speakers' speech.
arXiv Detail & Related papers (2022-12-20T12:02:40Z) - CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet)
The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context.
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z) - Transcribing Natural Languages for The Deaf via Neural Editing Programs [84.0592111546958]
We study the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses.
Previous sequence-to-sequence language models often fail to capture the rich connections between the two distinct languages, leading to unsatisfactory transcriptions.
We observe that despite different grammars, glosses effectively simplify sentences for the ease of deaf communication, while sharing a large portion of vocabulary with sentences.
arXiv Detail & Related papers (2021-12-17T16:21:49Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.