Emotion Selectable End-to-End Text-based Speech Editing
- URL: http://arxiv.org/abs/2212.10191v1
- Date: Tue, 20 Dec 2022 12:02:40 GMT
- Title: Emotion Selectable End-to-End Text-based Speech Editing
- Authors: Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen, Chu Yuan
Zhang
- Abstract summary: Emo-CampNet (emotion CampNet) is an emotion-selectable text-based speech editing model.
It can effectively control the emotion of the generated speech in the process of text-based speech editing.
It can also edit unseen speakers' speech.
- Score: 63.346825713704625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based speech editing allows users to edit speech by intuitively cutting,
copying, and pasting text to speed up the process of editing speech. In the
previous work, CampNet (context-aware mask prediction network) is proposed to
realize text-based speech editing, significantly improving the quality of
edited speech. This paper aims at a new task: adding emotional effect to the
editing speech during the text-based speech editing to make the generated
speech more expressive. To achieve this task, we propose Emo-CampNet (emotion
CampNet), which can provide the option of emotional attributes for the
generated speech in text-based speech editing and has the one-shot ability to
edit unseen speakers' speech. Firstly, we propose an end-to-end
emotion-selectable text-based speech editing model. The key idea of the model
is to control the emotion of generated speech by introducing additional emotion
attributes based on the context-aware mask prediction network. Secondly, to
prevent the emotion of the generated speech from being interfered by the
emotional components in the original speech, a neutral content generator is
proposed to remove the emotion from the original speech, which is optimized by
the generative adversarial framework. Thirdly, two data augmentation methods
are proposed to enrich the emotional and pronunciation information in the
training set, which can enable the model to edit the unseen speaker's speech.
The experimental results that 1) Emo-CampNet can effectively control the
emotion of the generated speech in the process of text-based speech editing;
And can edit unseen speakers' speech. 2) Detailed ablation experiments further
prove the effectiveness of emotional selectivity and data augmentation methods.
The demo page is available at https://hairuo55.github.io/Emo-CampNet/
Related papers
- EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech [0.0]
State-of-the-art speech models try to get as close as possible to the human voice.
Modelling emotions is an essential part of Text-To-Speech (TTS) research.
EmoSpeech surpasses existing models regarding both MOS score and emotion recognition accuracy in generated speech.
arXiv Detail & Related papers (2023-06-28T19:34:16Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet)
The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context.
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z) - Emotional Prosody Control for Speech Generation [7.66200737962746]
We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space.
The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
arXiv Detail & Related papers (2021-11-07T08:52:04Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.