Accented Speech Recognition Inspired by Human Perception
- URL: http://arxiv.org/abs/2104.04627v1
- Date: Fri, 9 Apr 2021 22:35:09 GMT
- Title: Accented Speech Recognition Inspired by Human Perception
- Authors: Xiangyun Chu (1), Elizabeth Combs (1), Amber Wang (1), Michael Picheny
(2) ((1) Center for Data Science, New York University, (2) Courant Computer
Science and Center for Data Science, New York University)
- Abstract summary: This paper explores methods that are inspired by human perception to evaluate possible performance improvements for recognition of accented speech.
We explore four methodologies: pre-exposure to multiple accents, grapheme and phoneme-based pronunciations, dropout, and the identification of the layers in the neural network that can specifically be associated with accent modeling.
Our results indicate that methods based on human perception are promising in reducing WER and understanding how accented speech is modeled in neural networks for novel accents.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While improvements have been made in automatic speech recognition performance
over the last several years, machines continue to have significantly lower
performance on accented speech than humans. In addition, the most significant
improvements on accented speech primarily arise by overwhelming the problem
with hundreds or even thousands of hours of data. Humans typically require much
less data to adapt to a new accent. This paper explores methods that are
inspired by human perception to evaluate possible performance improvements for
recognition of accented speech, with a specific focus on recognizing speech
with a novel accent relative to that of the training data. Our experiments are
run on small, accessible datasets that are available to the research community.
We explore four methodologies: pre-exposure to multiple accents, grapheme and
phoneme-based pronunciations, dropout (to improve generalization to a novel
accent), and the identification of the layers in the neural network that can
specifically be associated with accent modeling. Our results indicate that
methods based on human perception are promising in reducing WER and
understanding how accented speech is modeled in neural networks for novel
accents.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Analysis of French Phonetic Idiosyncrasies for Accent Recognition [0.8602553195689513]
Differences in pronunciation, in accent and intonation of speech in general, create one of the most common problems of speech recognition.
We use traditional machine learning techniques and convolutional neural networks, and show that the classical techniques are not sufficiently efficient to solve this problem.
In this paper, we focus our attention on the French accent. We also identify its limitation by understanding the impact of French idiosyncrasies on its spectrograms.
arXiv Detail & Related papers (2021-10-18T10:50:50Z) - Deep Discriminative Feature Learning for Accent Recognition [14.024346215923972]
We adopt Convolutional Recurrent Neural Network as front-end encoder and integrate local features using Recurrent Neural Network to make an utterance-level accent representation.
We show that our proposed network with discriminative training method is significantly ahead of the baseline system on the accent classification track in the Accented English Speech Recognition Challenge 2020.
arXiv Detail & Related papers (2020-11-25T00:46:47Z) - Super-Human Performance in Online Low-latency Recognition of
Conversational Speech [18.637636841477]
We present results for a system that can achieve super-human performance at a word based latency of only 1 second behind a speaker's speech.
The system uses multiple attention-based encoder-decoder networks integrated within a novel low latency incremental inference approach.
arXiv Detail & Related papers (2020-10-07T14:41:32Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.