SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
- URL: http://arxiv.org/abs/2307.07421v3
- Date: Thu, 11 Jul 2024 09:20:23 GMT
- Title: SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
- Authors: Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya,
- Abstract summary: This paper proposes a novel linear-time alternative to self-attention.
It summarises an utterance with the mean over vectors for all time steps.
This single summary is then combined with time-specific information.
- Score: 17.360059094663182
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.
Related papers
- SSR: Alignment-Aware Modality Connector for Speech Language Models [23.859649312290447]
Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality.
We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion.
arXiv Detail & Related papers (2024-09-30T19:17:46Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition [15.302106458232878]
SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition.
This work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode.
It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios.
arXiv Detail & Related papers (2024-09-11T10:24:43Z) - Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - Improving the Gap in Visual Speech Recognition Between Normal and Silent
Speech Based on Metric Learning [11.50011780498048]
This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR)
We propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes.
Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available.
arXiv Detail & Related papers (2023-05-23T16:20:46Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Directed Speech Separation for Automatic Speech Recognition of Long Form
Conversational Speech [10.291482850329892]
We propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal.
We achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
arXiv Detail & Related papers (2021-12-10T23:07:48Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.