CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing
- URL: http://arxiv.org/abs/2202.09950v1
- Date: Mon, 21 Feb 2022 02:05:14 GMT
- Title: CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing
- Authors: Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen
- Abstract summary: This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet)
The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context.
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
- Score: 67.96138567288197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The text-based speech editor allows the editing of speech through intuitive
cutting, copying, and pasting operations to speed up the process of editing
speech. However, the major drawback of current systems is that edited speech
often sounds unnatural due to cut-copy-paste operation. In addition, it is not
obvious how to synthesize records according to a new word not appearing in the
transcript. This paper proposes a novel end-to-end text-based speech editing
method called context-aware mask prediction network (CampNet). The model can
simulate the text-based speech editing process by randomly masking part of
speech and then predicting the masked region by sensing the speech context. It
can solve unnatural prosody in the edited region and synthesize the speech
corresponding to the unseen words in the transcript. Secondly, for the possible
operation of text-based speech editing, we design three text-based operations
based on CampNet: deletion, insertion, and replacement. These operations can
cover various situations of speech editing. Thirdly, to synthesize the speech
corresponding to long text in insertion and replacement operations, a
word-level autoregressive generation method is proposed. Fourthly, we propose a
speaker adaptation method using only one sentence for CampNet and explore the
ability of few-shot learning based on CampNet, which provides a new idea for
speech forgery tasks. The subjective and objective experiments on VCTK and
LibriTTS datasets show that the speech editing results based on CampNet are
better than TTS technology, manual editing, and VoCo method. We also conduct
detailed ablation experiments to explore the effect of the CampNet structure on
its performance. Finally, the experiment shows that speaker adaptation with
only one sentence can further improve the naturalness of speech. Examples of
generated speech can be found at https://hairuo55.github.io/CampNet.
Related papers
- FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting.
Current TSE techniques focus on minimizing discrepancies between generated speech and reference targets within edited segments.
seamlessly integrating edited segments with unaltered portions of the audio remains challenging.
This paper introduces a novel approach, FluentEditor$tiny +$, designed to overcome these limitations.
arXiv Detail & Related papers (2024-09-28T10:18:35Z) - Speech Editing -- a Summary [8.713498822221222]
This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing.
The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
arXiv Detail & Related papers (2024-07-24T11:22:57Z) - Emotion Selectable End-to-End Text-based Speech Editing [63.346825713704625]
Emo-CampNet (emotion CampNet) is an emotion-selectable text-based speech editing model.
It can effectively control the emotion of the generated speech in the process of text-based speech editing.
It can also edit unseen speakers' speech.
arXiv Detail & Related papers (2022-12-20T12:02:40Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.