CorrectSpeech: A Fully Automated System for Speech Correction and Accent
Reduction
- URL: http://arxiv.org/abs/2204.05460v1
- Date: Tue, 12 Apr 2022 01:20:29 GMT
- Title: CorrectSpeech: A Fully Automated System for Speech Correction and Accent
Reduction
- Authors: Daxin Tan, Liqun Deng, Nianzu Zheng, Yu Ting Yeung, Xin Jiang, Xiao
Chen, Tan Lee
- Abstract summary: The proposed system, named CorrectSpeech, performs the correction in three steps.
The quality and naturalness of corrected speech depend on the performance of speech recognition and alignment modules.
The results demonstrate that our system is able to correct mispronunciation and reduce accent in speech recordings.
- Score: 37.52612296258531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study extends our previous work on text-based speech editing to
developing a fully automated system for speech correction and accent reduction.
Consider the application scenario that a recorded speech audio contains certain
errors, e.g., inappropriate words, mispronunciations, that need to be
corrected. The proposed system, named CorrectSpeech, performs the correction in
three steps: recognizing the recorded speech and converting it into
time-stamped symbol sequence, aligning recognized symbol sequence with target
text to determine locations and types of required edit operations, and
generating the corrected speech. Experiments show that the quality and
naturalness of corrected speech depend on the performance of speech recognition
and alignment modules, as well as the granularity level of editing operations.
The proposed system is evaluated on two corpora: a manually perturbed version
of VCTK and L2-ARCTIC. The results demonstrate that our system is able to
correct mispronunciation and reduce accent in speech recordings. Audio samples
are available online for demonstration
https://daxintan-cuhk.github.io/CorrectSpeech/ .
Related papers
- Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Speech Editing -- a Summary [8.713498822221222]
This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing.
The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
arXiv Detail & Related papers (2024-07-24T11:22:57Z) - DisfluencyFixer: A tool to enhance Language Learning through Speech To
Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi.
Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Correcting Misproducted Speech using Spectrogram Inpainting [15.565673574838934]
This paper proposes a method to synthetically generate correct pronunciation feedback given incorrect production.
The system prompts the user to pronounce a phrase. The speech is recorded, and the samples associated with the inaccurate phoneme are masked with zeros.
Results suggest that human listeners slightly prefer our generated speech over a smoothed replacement of the inaccurate phoneme with a production of a different speaker.
arXiv Detail & Related papers (2022-04-07T11:58:29Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.