Related papers: A low latency attention module for streaming self-supervised speech representation learning

A low latency attention module for streaming self-supervised speech representation learning

URL: http://arxiv.org/abs/2302.13451v2
Date: Mon, 18 Mar 2024 01:09:44 GMT
Title: A low latency attention module for streaming self-supervised speech representation learning
Authors: Jianbo Ma, Siqi Pan, Deepak Chandran, Andrea Fanelli, Richard Cartwright,
Abstract summary: Self-latency speech representation learning (SSRL) is a popular use-case for the transformer architecture. We present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements. Our implementation also reduces the inference latency from 1.92 to 0.16 seconds.
Score: 0.4288177321445912
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL implementation, while the LLSA solves the latency build-up problem of other streaming attention architectures, such as the masked acausal attention (MAA), guaranteeing a latency equal to one layer even when multiple layers are stacked. We present a comparative analysis between the vanilla attention, which we will refer here as acausal attention (AA), the SA, and the LLSA, by training a streaming SSRL with automatic speech recognition as downstream task. When training on librispeech-clean-100 and testing on librispeech-test-clean, our low-latency attention module has a word error rate (WER) of 5.84%, which represents a significant improvement over the MAA (WER = 13.82%). Our implementation also reduces the inference latency from 1.92 to 0.16 seconds. The proposed low-latency module preserves many of the benefits of conventional acausal transformers, but also enables latency characteristics that make it applicable to real-time streaming applications.

Related papers

Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement [13.38679135071682]
We propose a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules.<n>We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency.
arXiv Detail & Related papers (2025-05-24T02:23:46Z)
FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction [11.146015814220858]
FIRST is an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence. Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on task.
arXiv Detail & Related papers (2024-10-16T12:45:35Z)
Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context. Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z)
Short-Term Memory Convolutions [0.0]
We propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) The training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs) In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality.
arXiv Detail & Related papers (2023-02-08T20:52:24Z)
MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST) In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)
WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers. On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z)
Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL) We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z)
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z)
Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR) The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z)
Weak-Attention Suppression For Transformer Based Speech Recognition [33.30436927415777]
We propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities. We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines.
arXiv Detail & Related papers (2020-05-18T23:49:40Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.