Related papers: Context-Aware Prosody Correction for Text-Based Speech Editing

Context-Aware Prosody Correction for Text-Based Speech Editing

URL: http://arxiv.org/abs/2102.08328v1
Date: Tue, 16 Feb 2021 18:16:30 GMT
Title: Context-Aware Prosody Correction for Text-Based Speech Editing
Authors: Max Morrison, Lucas Rencker, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo
Abstract summary: A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions. We propose a new context-aware method for more natural sounding text-based editing of speech.
Score: 28.459695630420832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding text-based editing of speech. To do so, we 1) use a series of neural networks to generate salient prosody features that are dependent on the prosody of speech surrounding the edit and amenable to fine-grained user control 2) use the generated features to control a standard pitch-shift and time-stretch method and 3) apply a denoising neural network to remove artifacts induced by the signal manipulation to yield a high-fidelity result. We evaluate our approach using a subjective listening test, provide a detailed comparative analysis, and conclude several interesting insights.

Related papers

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits [82.8859060022651]
We introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization.
arXiv Detail & Related papers (2025-01-07T14:17:47Z)
FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting. Current TSE techniques focus on minimizing discrepancies between generated speech and reference targets within edited segments. seamlessly integrating edited segments with unaltered portions of the audio remains challenging. This paper introduces a novel approach, FluentEditor$tiny +$, designed to overcome these limitations.
arXiv Detail & Related papers (2024-09-28T10:18:35Z)
Speech Editing -- a Summary [8.713498822221222]
This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing. The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
arXiv Detail & Related papers (2024-07-24T11:22:57Z)
Towards General-Purpose Text-Instruction-Guided Voice Conversion [84.78206348045428]
This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice" The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
arXiv Detail & Related papers (2023-09-25T17:52:09Z)
FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency [44.7425844190807]
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. We propose a fluency speech editing model, termed textitFluentEditor, by considering fluency-aware training criterion in the TSE training. The subjective and objective experimental results on VCTK demonstrate that our textitFluentEditor outperforms all advanced baselines in terms of naturalness and fluency.
arXiv Detail & Related papers (2023-09-21T01:58:01Z)
Emotion Selectable End-to-End Text-based Speech Editing [63.346825713704625]
Emo-CampNet (emotion CampNet) is an emotion-selectable text-based speech editing model. It can effectively control the emotion of the generated speech in the process of text-based speech editing. It can also edit unseen speakers' speech.
arXiv Detail & Related papers (2022-12-20T12:02:40Z)
CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet) The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context. It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z)
Transcribing Natural Languages for The Deaf via Neural Editing Programs [84.0592111546958]
We study the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses. Previous sequence-to-sequence language models often fail to capture the rich connections between the two distinct languages, leading to unsatisfactory transcriptions. We observe that despite different grammars, glosses effectively simplify sentences for the ease of deaf communication, while sharing a large portion of vocabulary with sentences.
arXiv Detail & Related papers (2021-12-17T16:21:49Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.