Related papers: DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

URL: http://arxiv.org/abs/2409.12992v1
Date: Thu, 19 Sep 2024 07:11:54 GMT
Title: DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency
Authors: Yang Chen, Yuhang Jia, Shiwan Zhao, Ziyue Jiang, Haoran Li, Jiarong Kang, Yong Qin,
Abstract summary: We introduce DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios. We enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. We propose a first-order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech.
Score: 20.3466261946094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As text-based speech editing becomes increasingly prevalent, the demand for unrestricted free-text editing continues to grow. However, existing speech editing techniques encounter significant challenges, particularly in maintaining intelligibility and acoustic consistency when dealing with out-of-domain (OOD) text. In this paper, we introduce, DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios through semantic enrichment and acoustic consistency. To improve the intelligibility of the edited speech, we enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. Furthermore, we emphasize that interframe smoothing properties are critical for modeling acoustic consistency, and thus we propose a first-order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech. Experimental results demonstrate that our model achieves state-of-the-art performance in both in-domain and OOD text scenarios.

Related papers

FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting. Current TSE techniques focus on minimizing discrepancies between generated speech and reference targets within edited segments. seamlessly integrating edited segments with unaltered portions of the audio remains challenging. This paper introduces a novel approach, FluentEditor$tiny +$, designed to overcome these limitations.
arXiv Detail & Related papers (2024-09-28T10:18:35Z)
uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z)
FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency [44.7425844190807]
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. We propose a fluency speech editing model, termed textitFluentEditor, by considering fluency-aware training criterion in the TSE training. The subjective and objective experimental results on VCTK demonstrate that our textitFluentEditor outperforms all advanced baselines in terms of naturalness and fluency.
arXiv Detail & Related papers (2023-09-21T01:58:01Z)
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis. This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z)
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z)
Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders [14.723225542605105]
Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech similar to the original speech in terms of clarity, speaker identity, and prosody. This work focuses on the zero-shot approach which avoids finetuning altogether.
arXiv Detail & Related papers (2022-10-28T10:31:44Z)
A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing [31.666920933058144]
We propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. Experiments show A$3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
arXiv Detail & Related papers (2022-03-18T01:36:25Z)
Transcribing Natural Languages for The Deaf via Neural Editing Programs [84.0592111546958]
We study the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses. Previous sequence-to-sequence language models often fail to capture the rich connections between the two distinct languages, leading to unsatisfactory transcriptions. We observe that despite different grammars, glosses effectively simplify sentences for the ease of deaf communication, while sharing a large portion of vocabulary with sentences.
arXiv Detail & Related papers (2021-12-17T16:21:49Z)
Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions. We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.