Correcting Misproducted Speech using Spectrogram Inpainting
- URL: http://arxiv.org/abs/2204.03379v1
- Date: Thu, 7 Apr 2022 11:58:29 GMT
- Title: Correcting Misproducted Speech using Spectrogram Inpainting
- Authors: Talia Ben-Simon, Felix Kreuk, Faten Awwad, Jacob T. Cohen, Joseph
Keshet
- Abstract summary: This paper proposes a method to synthetically generate correct pronunciation feedback given incorrect production.
The system prompts the user to pronounce a phrase. The speech is recorded, and the samples associated with the inaccurate phoneme are masked with zeros.
Results suggest that human listeners slightly prefer our generated speech over a smoothed replacement of the inaccurate phoneme with a production of a different speaker.
- Score: 15.565673574838934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning a new language involves constantly comparing speech productions with
reference productions from the environment. Early in speech acquisition,
children make articulatory adjustments to match their caregivers' speech.
Grownup learners of a language tweak their speech to match the tutor reference.
This paper proposes a method to synthetically generate correct pronunciation
feedback given incorrect production. Furthermore, our aim is to generate the
corrected production while maintaining the speaker's original voice.
The system prompts the user to pronounce a phrase. The speech is recorded,
and the samples associated with the inaccurate phoneme are masked with zeros.
This waveform serves as an input to a speech generator, implemented as a deep
learning inpainting system with a U-net architecture, and trained to output a
reconstructed speech. The training set is composed of unimpaired proper speech
examples, and the generator is trained to reconstruct the original proper
speech. We evaluated the performance of our system on phoneme replacement of
minimal pair words of English as well as on children with pronunciation
disorders. Results suggest that human listeners slightly prefer our generated
speech over a smoothed replacement of the inaccurate phoneme with a production
of a different speaker.
Related papers
- Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations.
We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - CorrectSpeech: A Fully Automated System for Speech Correction and Accent
Reduction [37.52612296258531]
The proposed system, named CorrectSpeech, performs the correction in three steps.
The quality and naturalness of corrected speech depend on the performance of speech recognition and alignment modules.
The results demonstrate that our system is able to correct mispronunciation and reduce accent in speech recordings.
arXiv Detail & Related papers (2022-04-12T01:20:29Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - DeepFry: Identifying Vocal Fry Using Deep Neural Networks [16.489251286870704]
Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch.
Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems.
This paper proposes a deep learning model to detect creaky voice in fluent speech.
arXiv Detail & Related papers (2022-03-31T13:23:24Z) - CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet)
The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context.
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Translatotron 2: Robust direct speech-to-speech translation [6.3470332633611015]
We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end.
Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness.
We propose a new method for retaining the source speaker's voice in the translated speech.
arXiv Detail & Related papers (2021-07-19T07:43:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.