Related papers: Scaling Up Online Speech Recognition Using ConvNets

Scaling Up Online Speech Recognition Using ConvNets

URL: http://arxiv.org/abs/2001.09727v1
Date: Mon, 27 Jan 2020 12:55:02 GMT
Title: Scaling Up Online Speech Recognition Using ConvNets
Authors: Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert
Abstract summary: We design an online end-to-end speech recognition system based on Time-Depth Separable ( TDS) convolutions and Connectionist Temporal Classification (CTC) We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
Score: 33.75588539732141
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency, accuracy, and discuss how these metrics can be tuned based on the user requirements.

Related papers

Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition [26.665132884613477]
The Spike Window Decoding algorithm greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output. Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets.
arXiv Detail & Related papers (2025-01-01T12:20:07Z)
An Efficient and Streaming Audio Visual Active Speaker Detection System [2.4515389321702132]
We present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference.
arXiv Detail & Related papers (2024-09-13T17:45:53Z)
Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency [44.99833362998488]
The latency is the time span from audio input to the output of the corresponding speaker label. The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding. The FS-EEND system shows a similarly good latency.
arXiv Detail & Related papers (2024-07-05T06:54:27Z)
Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z)
Semantic Communication Enabling Robust Edge Intelligence for Time-Critical IoT Applications [87.05763097471487]
This paper aims to design robust Edge Intelligence using semantic communication for time-critical IoT applications. We analyze the effect of image DCT coefficients on inference accuracy and propose the channel-agnostic effectiveness encoding for offloading.
arXiv Detail & Related papers (2022-11-24T20:13:17Z)
An Intelligent Deterministic Scheduling Method for Ultra-Low Latency Communication in Edge Enabled Industrial Internet of Things [19.277349546331557]
Time Sensitive Network (TSN) is recently researched to realize low latency communication via deterministic scheduling. Non-collision theory based deterministic scheduling (NDS) method is proposed to achieve ultra-low latency communication for the time-sensitive flows. Experiment results demonstrate that NDS/DQS can well support deterministic ultra-low latency services and guarantee efficient bandwidth utilization.
arXiv Detail & Related papers (2022-07-17T16:52:51Z)
Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition [22.61761934996406]
We propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder. With a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20%. The word error rate (WER) is reduced by 6% relative at the beam width of 1 and by 3% relative at the beam width of 4.
arXiv Detail & Related papers (2022-04-08T07:24:00Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)
Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z)
Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL) We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR. We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.