An Improved Single Step Non-autoregressive Transformer for Automatic
Speech Recognition
- URL: http://arxiv.org/abs/2106.09885v1
- Date: Fri, 18 Jun 2021 02:58:30 GMT
- Title: An Improved Single Step Non-autoregressive Transformer for Automatic
Speech Recognition
- Authors: Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao and Abeer Alwan
- Abstract summary: Non-autoregressive mechanisms can significantly decrease inference time for speech transformers.
Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT)
We propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses.
- Score: 28.06475768075206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive mechanisms can significantly decrease inference time for
speech transformers, especially when the single step variant is applied.
Previous work on CTC alignment-based single step non-autoregressive transformer
(CASS-NAT) has shown a large real time factor (RTF) improvement over
autoregressive transformers (AT). In this work, we propose several methods to
improve the accuracy of the end-to-end CASS-NAT, followed by performance
analyses. First, convolution augmented self-attention blocks are applied to
both the encoder and decoder modules. Second, we propose to expand the trigger
mask (acoustic boundary) for each token to increase the robustness of CTC
alignments. In addition, iterated loss functions are used to enhance the
gradient update of low-layer parameters. Without using an external language
model, the WERs of the improved CASS-NAT, when using the three methods, are
3.1%/7.2% on Librispeech test clean/other sets and the CER is 5.4% on the
Aishell1 test set, achieving a 7%~21% relative WER/CER improvement. For the
analyses, we plot attention weight distributions in the decoders to visualize
the relationships between token-level acoustic embeddings. When the acoustic
embeddings are visualized, we find that they have a similar behavior to word
embeddings, which explains why the improved CASS-NAT performs similarly to AT.
Related papers
- Kolmogorov-Arnold Transformer [72.88137795439407]
We introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces layers with Kolmogorov-Arnold Network (KAN) layers.
We identify three key challenges: (C1) Base function, (C2) Inefficiency, and (C3) Weight.
With these designs, KAT outperforms traditional-based transformers.
arXiv Detail & Related papers (2024-09-16T17:54:51Z) - Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo
Matching [77.133400999703]
Correlation based stereo matching has achieved outstanding performance.
Current methods with a fixed model do not work uniformly well across various datasets.
This paper proposes a new perspective to dynamically calculate correlation for robust stereo matching.
arXiv Detail & Related papers (2023-07-26T09:47:37Z) - A CTC Alignment-based Non-autoregressive Transformer for End-to-end
Automatic Speech Recognition [26.79184118279807]
We present a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR.
word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs.
We find that CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a 24x inference speedup.
arXiv Detail & Related papers (2023-04-15T18:34:29Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - TCCT: Tightly-Coupled Convolutional Transformer on Time Series
Forecasting [6.393659160890665]
We propose the concept of tightly-coupled convolutional Transformer(TCCT) and three TCCT architectures.
Our experiments on real-world datasets show that our TCCT architectures could greatly improve the performance of existing state-of-art Transformer models.
arXiv Detail & Related papers (2021-08-29T08:49:31Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z) - CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer
for Speech Recognition [29.55887842348706]
We propose a CTC alignment-based single step non-autoregressive decoder transformer (CASS-NAT) for speech recognition.
During inference, an error-based alignment method is proposed to be applied to the CTC space, reducing the WER and retaining the output as well.
arXiv Detail & Related papers (2020-10-28T03:14:05Z) - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates.
We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z) - Weak-Attention Suppression For Transformer Based Speech Recognition [33.30436927415777]
We propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities.
We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines.
arXiv Detail & Related papers (2020-05-18T23:49:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.