Scaling Up Online Speech Recognition Using ConvNets
- URL: http://arxiv.org/abs/2001.09727v1
- Date: Mon, 27 Jan 2020 12:55:02 GMT
- Title: Scaling Up Online Speech Recognition Using ConvNets
- Authors: Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana
Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan
Collobert
- Abstract summary: We design an online end-to-end speech recognition system based on Time-Depth Separable ( TDS) convolutions and Connectionist Temporal Classification (CTC)
We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy.
The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
- Score: 33.75588539732141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We design an online end-to-end speech recognition system based on Time-Depth
Separable (TDS) convolutions and Connectionist Temporal Classification (CTC).
We improve the core TDS architecture in order to limit the future context and
hence reduce latency while maintaining accuracy. The system has almost three
times the throughput of a well tuned hybrid ASR baseline while also having
lower latency and a better word error rate. Also important to the efficiency of
the recognizer is our highly optimized beam search decoder. To show the impact
of our design choices, we analyze throughput, latency, accuracy, and discuss
how these metrics can be tuned based on the user requirements.
Related papers
- An Efficient and Streaming Audio Visual Active Speaker Detection System [2.4515389321702132]
We present two scenarios that address the key challenges posed by real-time constraints.
First, we introduce a method to limit the number of future context frames utilized by the ASD model.
Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference.
arXiv Detail & Related papers (2024-09-13T17:45:53Z) - Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency [44.99833362998488]
The latency is the time span from audio input to the output of the corresponding speaker label.
The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding.
The FS-EEND system shows a similarly good latency.
arXiv Detail & Related papers (2024-07-05T06:54:27Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Semantic Communication Enabling Robust Edge Intelligence for
Time-Critical IoT Applications [87.05763097471487]
This paper aims to design robust Edge Intelligence using semantic communication for time-critical IoT applications.
We analyze the effect of image DCT coefficients on inference accuracy and propose the channel-agnostic effectiveness encoding for offloading.
arXiv Detail & Related papers (2022-11-24T20:13:17Z) - An Intelligent Deterministic Scheduling Method for Ultra-Low Latency
Communication in Edge Enabled Industrial Internet of Things [19.277349546331557]
Time Sensitive Network (TSN) is recently researched to realize low latency communication via deterministic scheduling.
Non-collision theory based deterministic scheduling (NDS) method is proposed to achieve ultra-low latency communication for the time-sensitive flows.
Experiment results demonstrate that NDS/DQS can well support deterministic ultra-low latency services and guarantee efficient bandwidth utilization.
arXiv Detail & Related papers (2022-07-17T16:52:51Z) - Adding Connectionist Temporal Summarization into Conformer to Improve
Its Decoder Efficiency For Speech Recognition [22.61761934996406]
We propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder.
With a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20%.
The word error rate (WER) is reduced by 6% relative at the beam width of 1 and by 3% relative at the beam width of 4.
arXiv Detail & Related papers (2022-04-08T07:24:00Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.