Mamba for Streaming ASR Combined with Unimodal Aggregation
- URL: http://arxiv.org/abs/2410.00070v1
- Date: Mon, 30 Sep 2024 12:11:49 GMT
- Title: Mamba for Streaming ASR Combined with Unimodal Aggregation
- Authors: Ying Fang, Xiaofei Li,
- Abstract summary: Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks.
We propose an associated lookahead mechanism for leveraging controllable future information.
Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance.
- Score: 7.6112706449833505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.
Related papers
- Bidirectional Gated Mamba for Sequential Recommendation [56.85338055215429]
Mamba, a recent advancement, has exhibited exceptional performance in time series prediction.
We introduce a new framework named Selective Gated Mamba ( SIGMA) for Sequential Recommendation.
Our results indicate that SIGMA outperforms current models on five real-world datasets.
arXiv Detail & Related papers (2024-08-21T09:12:59Z) - MALT: Multi-scale Action Learning Transformer for Online Action Detection [6.819772961626024]
Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames.
We propose a multi-scale action learning transformer (MALT) that includes a novel recurrent decoder (used for feature fusion)
We also introduce an explicit frame scoring mechanism employing sparse attention, which filters irrelevant frames more efficiently.
arXiv Detail & Related papers (2024-05-31T15:03:35Z) - Streaming Sequence Transduction through Dynamic Compression [55.0083843520833]
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams.
STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR)
STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
arXiv Detail & Related papers (2024-02-02T06:31:50Z) - An Investigation of Enhancing CTC Model for Triggered Attention-based
Streaming ASR [19.668440671541546]
An attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system.
The proposed method achieves higher accuracy with lower latency than the conventional triggered attention-based streaming ASR system.
arXiv Detail & Related papers (2021-10-20T06:44:58Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.