Developing Real-time Streaming Transformer Transducer for Speech
Recognition on Large-scale Dataset
- URL: http://arxiv.org/abs/2010.11395v3
- Date: Sun, 28 Feb 2021 08:02:29 GMT
- Title: Developing Real-time Streaming Transformer Transducer for Speech
Recognition on Large-scale Dataset
- Authors: Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, Jinyu Li
- Abstract summary: Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset.
We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model.
We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario.
- Score: 37.619200507404145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformer based end-to-end models have achieved great success in
many areas including speech recognition. However, compared to LSTM models, the
heavy computational cost of the Transformer during inference is a key issue to
prevent their applications. In this work, we explored the potential of
Transformer Transducer (T-T) models for the fist pass decoding with low latency
and fast speed on a large-scale dataset. We combine the idea of Transformer-XL
and chunk-wise streaming processing to design a streamable Transformer
Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN
Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder
model in the streaming scenario. Furthermore, the runtime cost and latency can
be optimized with a relatively small look-ahead.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - TransAxx: Efficient Transformers with Approximate Computing [4.347898144642257]
Vision Transformer (ViT) models have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs)
We propose TransAxx, a framework based on the popular PyTorch library that enables fast inherent support for approximate arithmetic.
Our approach uses a Monte Carlo Tree Search (MCTS) algorithm to efficiently search the space of possible configurations.
arXiv Detail & Related papers (2024-02-12T10:16:05Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.
We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately.
The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable
End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR)
The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR.
We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Exploring Transformers for Large-Scale Speech Recognition [34.645597506707055]
We show that Transformers can achieve around 6% relative word error rate (WER) reduction compared to the BLSTM baseline in the offline fashion.
In the streaming fashion, Transformer-XL is comparable to LC-BLSTM with 800 millisecond latency constraint.
arXiv Detail & Related papers (2020-05-19T18:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.