Visual-speech Synthesis of Exaggerated Corrective Feedback
- URL: http://arxiv.org/abs/2009.05748v2
- Date: Tue, 15 Dec 2020 13:16:53 GMT
- Title: Visual-speech Synthesis of Exaggerated Corrective Feedback
- Authors: Yaohua Bu, Weijun Li, Tianyi Ma, Shengqi Chen, Jia Jia, Kun Li, Xiaobo
Lu
- Abstract summary: We propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT)
The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron.
We show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.
- Score: 32.88905525975493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To provide more discriminative feedback for the second language (L2) learners
to better identify their mispronunciation, we propose a method for exaggerated
visual-speech feedback in computer-assisted pronunciation training (CAPT). The
speech exaggeration is realized by an emphatic speech generation neural network
based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme
Blending, namely increasing Amplitude of movement, extending the phone's
Duration and enhancing the color Contrast. User studies show that exaggerated
feedback outperforms non-exaggerated version on helping learners with
pronunciation identification and pronunciation improvement.
Related papers
- VALLR: Visual ASR Language Model for Lip Reading [28.561566996686484]
Lip Reading, or Visual Automatic Speech Recognition, is a complex task requiring the interpretation of spoken language exclusively from visual cues.
We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR)
First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head.
This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences.
arXiv Detail & Related papers (2025-03-27T11:52:08Z) - Enhancing nonnative speech perception and production through an AI-powered application [0.0]
The aim of this study is to examine the impact of training with an AI-powered mobile application on nonnative sound perception and production.
The intervention involved training with the Speakometer mobile application, which incorporated recording tasks featuring the English vowels, along with pronunciation feedback and practice.
The results revealed significant improvements in both discrimination accuracy and production of the target contrast following the intervention.
arXiv Detail & Related papers (2025-03-18T10:05:12Z) - Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning [13.113505050543298]
We introduce a large language model capable of processing speech inputs.
We show that tuning it further with reinforcement learning on human preference enables it to adapt better to disordered speech than traditional fine-tuning.
arXiv Detail & Related papers (2024-12-25T00:16:22Z) - Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition
and Phoneme to Grapheme Translation [9.118302330129284]
This research optimize two-pass cross-lingual transfer learning in low-resource languages.
We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics.
We introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation.
arXiv Detail & Related papers (2023-12-06T06:37:24Z) - Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training [102.18680666349806]
We propose a speed co-augmentation method that randomly changes the playback speeds of both audio and video data.
Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
arXiv Detail & Related papers (2023-09-25T08:22:30Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.