Context-Aware Prosody Correction for Text-Based Speech Editing
- URL: http://arxiv.org/abs/2102.08328v1
- Date: Tue, 16 Feb 2021 18:16:30 GMT
- Title: Context-Aware Prosody Correction for Text-Based Speech Editing
- Authors: Max Morrison, Lucas Rencker, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo
Caceres, Bryan Pardo
- Abstract summary: A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
- Score: 28.459695630420832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based speech editors expedite the process of editing speech recordings
by permitting editing via intuitive cut, copy, and paste operations on a speech
transcript. A major drawback of current systems, however, is that edited
recordings often sound unnatural because of prosody mismatches around edited
regions. In our work, we propose a new context-aware method for more natural
sounding text-based editing of speech. To do so, we 1) use a series of neural
networks to generate salient prosody features that are dependent on the prosody
of speech surrounding the edit and amenable to fine-grained user control 2) use
the generated features to control a standard pitch-shift and time-stretch
method and 3) apply a denoising neural network to remove artifacts induced by
the signal manipulation to yield a high-fidelity result. We evaluate our
approach using a subjective listening test, provide a detailed comparative
analysis, and conclude several interesting insights.
Related papers
- FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting.
Current TSE techniques focus on minimizing discrepancies between generated speech and reference targets within edited segments.
seamlessly integrating edited segments with unaltered portions of the audio remains challenging.
This paper introduces a novel approach, FluentEditor$tiny +$, designed to overcome these limitations.
arXiv Detail & Related papers (2024-09-28T10:18:35Z) - Speech Editing -- a Summary [8.713498822221222]
This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing.
The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
arXiv Detail & Related papers (2024-07-24T11:22:57Z) - Towards General-Purpose Text-Instruction-Guided Voice Conversion [84.78206348045428]
This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice"
The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
arXiv Detail & Related papers (2023-09-25T17:52:09Z) - FluentEditor: Text-based Speech Editing by Considering Acoustic and
Prosody Consistency [44.7425844190807]
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself.
We propose a fluency speech editing model, termed textitFluentEditor, by considering fluency-aware training criterion in the TSE training.
The subjective and objective experimental results on VCTK demonstrate that our textitFluentEditor outperforms all advanced baselines in terms of naturalness and fluency.
arXiv Detail & Related papers (2023-09-21T01:58:01Z) - Emotion Selectable End-to-End Text-based Speech Editing [63.346825713704625]
Emo-CampNet (emotion CampNet) is an emotion-selectable text-based speech editing model.
It can effectively control the emotion of the generated speech in the process of text-based speech editing.
It can also edit unseen speakers' speech.
arXiv Detail & Related papers (2022-12-20T12:02:40Z) - CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet)
The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context.
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z) - Transcribing Natural Languages for The Deaf via Neural Editing Programs [84.0592111546958]
We study the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses.
Previous sequence-to-sequence language models often fail to capture the rich connections between the two distinct languages, leading to unsatisfactory transcriptions.
We observe that despite different grammars, glosses effectively simplify sentences for the ease of deaf communication, while sharing a large portion of vocabulary with sentences.
arXiv Detail & Related papers (2021-12-17T16:21:49Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.