Super-Human Performance in Online Low-latency Recognition of
Conversational Speech
- URL: http://arxiv.org/abs/2010.03449v5
- Date: Mon, 26 Jul 2021 20:56:49 GMT
- Title: Super-Human Performance in Online Low-latency Recognition of
Conversational Speech
- Authors: Thai-Son Nguyen, Sebastian Stueker, Alex Waibel
- Abstract summary: We present results for a system that can achieve super-human performance at a word based latency of only 1 second behind a speaker's speech.
The system uses multiple attention-based encoder-decoder networks integrated within a novel low latency incremental inference approach.
- Score: 18.637636841477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving super-human performance in recognizing human speech has been a goal
for several decades, as researchers have worked on increasingly challenging
tasks. In the 1990's it was discovered, that conversational speech between two
humans turns out to be considerably more difficult than read speech as
hesitations, disfluencies, false starts and sloppy articulation complicate
acoustic processing and require robust handling of acoustic, lexical and
language context, jointly. Early attempts with statistical models could only
reach error rates over 50% and far from human performance (WER of around 5.5%).
Neural hybrid models and recent attention-based encoder-decoder models have
considerably improved performance as such contexts can now be learned in an
integral fashion. However, processing such contexts requires an entire
utterance presentation and thus introduces unwanted delays before a recognition
result can be output. In this paper, we address performance as well as latency.
We present results for a system that can achieve super-human performance (at a
WER of 5.0%, over the Switchboard conversational benchmark) at a word based
latency of only 1 second behind a speaker's speech. The system uses multiple
attention-based encoder-decoder networks integrated within a novel low latency
incremental inference approach.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Child Speech Recognition in Human-Robot Interaction: Problem Solved? [0.024739484546803334]
Recent evolutions in data-driven speech recognition might mean a breakthrough for child speech recognition and social robot applications aimed at children.
We revisit a study on child speech recognition from 2017 and show that indeed performance has increased.
While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences.
arXiv Detail & Related papers (2024-04-26T13:14:28Z) - FlashSpeech: Efficient Zero-Shot Speech Synthesis [37.883762387219676]
FlashSpeech is a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work.
We show that FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity.
arXiv Detail & Related papers (2024-04-23T02:57:46Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - Self-Supervised Learning for Speech Enhancement through Synthesis [5.924928860260821]
We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech.
We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
arXiv Detail & Related papers (2022-11-04T16:06:56Z) - SepIt: Approaching a Single Channel Speech Separation Bound [99.19786288094596]
We introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation.
In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
arXiv Detail & Related papers (2022-05-24T05:40:36Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Accented Speech Recognition Inspired by Human Perception [0.0]
This paper explores methods that are inspired by human perception to evaluate possible performance improvements for recognition of accented speech.
We explore four methodologies: pre-exposure to multiple accents, grapheme and phoneme-based pronunciations, dropout, and the identification of the layers in the neural network that can specifically be associated with accent modeling.
Our results indicate that methods based on human perception are promising in reducing WER and understanding how accented speech is modeled in neural networks for novel accents.
arXiv Detail & Related papers (2021-04-09T22:35:09Z) - Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition [66.47000813920619]
We propose a non-autoregressive end-to-end speech recognition system called LASO.
Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens.
We conduct experiments on publicly available Chinese dataset AISHELL-1.
arXiv Detail & Related papers (2020-05-11T04:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.