Speech Editing -- a Summary
- URL: http://arxiv.org/abs/2407.17172v1
- Date: Wed, 24 Jul 2024 11:22:57 GMT
- Title: Speech Editing -- a Summary
- Authors: Tobias Kässmann, Yining Liu, Danni Liu,
- Abstract summary: This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing.
The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
- Score: 8.713498822221222
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rise of video production and social media, speech editing has become crucial for creators to address issues like mispronunciations, missing words, or stuttering in audio recordings. This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing. These approaches ensure edited audio is indistinguishable from the original by altering the mel-spectrogram. Recent advancements, such as context-aware prosody correction and advanced attention mechanisms, have improved speech editing quality. This paper reviews state-of-the-art methods, compares key metrics, and examines widely used datasets. The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
Related papers
- Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation [56.92841782969847]
We introduce a novel task called language-guided joint audio-visual editing.
Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance.
We propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas.
arXiv Detail & Related papers (2024-10-09T22:02:30Z) - FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting.
Current TSE techniques focus on minimizing discrepancies between generated speech and reference targets within edited segments.
seamlessly integrating edited segments with unaltered portions of the audio remains challenging.
This paper introduces a novel approach, FluentEditor$tiny +$, designed to overcome these limitations.
arXiv Detail & Related papers (2024-09-28T10:18:35Z) - Audio Editing with Non-Rigid Text Prompts [24.008609489049206]
We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio.
We explore text prompts that perform addition, style transfer, and in-painting.
arXiv Detail & Related papers (2023-10-19T16:09:44Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Emotion Selectable End-to-End Text-based Speech Editing [63.346825713704625]
Emo-CampNet (emotion CampNet) is an emotion-selectable text-based speech editing model.
It can effectively control the emotion of the generated speech in the process of text-based speech editing.
It can also edit unseen speakers' speech.
arXiv Detail & Related papers (2022-12-20T12:02:40Z) - CorrectSpeech: A Fully Automated System for Speech Correction and Accent
Reduction [37.52612296258531]
The proposed system, named CorrectSpeech, performs the correction in three steps.
The quality and naturalness of corrected speech depend on the performance of speech recognition and alignment modules.
The results demonstrate that our system is able to correct mispronunciation and reduce accent in speech recordings.
arXiv Detail & Related papers (2022-04-12T01:20:29Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet)
The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context.
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.