Related papers: Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

URL: http://arxiv.org/abs/2409.07165v1
Date: Wed, 11 Sep 2024 10:24:43 GMT
Title: Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition
Authors: Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya,
Abstract summary: SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition. This work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios.
Score: 15.302106458232878
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.

Related papers

Polynomial Mixing for Efficient Self-supervised Speech Encoders [50.58463928808225]
Polynomial Mixer (PoM) is a drop-in replacement for multi-head self-attention.<n>PoM achieves its performance on downstream speech recognition tasks.
arXiv Detail & Related papers (2026-02-28T14:45:55Z)
Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications [0.8691520242484038]
Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR)<n>We introduce v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference.<n>Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster.
arXiv Detail & Related papers (2026-02-12T18:20:45Z)
Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z)
Decoder-only Architecture for Streaming End-to-end Speech Recognition [45.161909551392085]
We propose to use a decoder-only architecture for blockwise streaming automatic speech recognition (ASR) In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder. Our proposed decoder-only streaming ASR achieves 8% relative word error rate reduction in the LibriSpeech test-other set while being twice as fast as the baseline model.
arXiv Detail & Related papers (2024-06-23T13:50:08Z)
Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation. We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z)
SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding [17.360059094663182]
This paper proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information.
arXiv Detail & Related papers (2023-07-12T12:51:23Z)
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what" Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)
Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR. We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z)
Streaming Multi-talker Speech Recognition with Joint Speaker Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user. We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z)
Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR. We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.