Weak-Attention Suppression For Transformer Based Speech Recognition
- URL: http://arxiv.org/abs/2005.09137v1
- Date: Mon, 18 May 2020 23:49:40 GMT
- Title: Weak-Attention Suppression For Transformer Based Speech Recognition
- Authors: Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank
Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer
- Abstract summary: We propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities.
We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines.
- Score: 33.30436927415777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers, originally proposed for natural language processing (NLP)
tasks, have recently achieved great success in automatic speech recognition
(ASR). However, adjacent acoustic units (i.e., frames) are highly correlated,
and long-distance dependencies between them are weak, unlike text units. It
suggests that ASR will likely benefit from sparse and localized attention. In
this paper, we propose Weak-Attention Suppression (WAS), a method that
dynamically induces sparsity in attention probabilities. We demonstrate that
WAS leads to consistent Word Error Rate (WER) improvement over strong
transformer baselines. On the widely used LibriSpeech benchmark, our proposed
method reduced WER by 10%$ on test-clean and 5% on test-other for streamable
transformers, resulting in a new state-of-the-art among streaming models.
Further analysis shows that WAS learns to suppress attention of non-critical
and redundant continuous acoustic frames, and is more likely to suppress past
frames rather than future ones. It indicates the importance of lookahead in
attention-based ASR models.
Related papers
- Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices [8.77712061194924]
We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models.
Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations.
We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
arXiv Detail & Related papers (2024-09-24T21:42:25Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Promptformer: Prompted Conformer Transducer for ASR [40.88399609719793]
We introduce a novel mechanism inspired by hyper-prompting to fuse textual context with acoustic representations in the attention mechanism.
Results on a test set with multi-turn interactions show that our method achieves 5.9% relative word error rate reduction (rWERR) over a strong baseline.
arXiv Detail & Related papers (2024-01-14T20:14:35Z) - Whispering LLaMA: A Cross-Modal Generative Error Correction Framework
for Speech Recognition [10.62060432965311]
We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR)
Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts.
arXiv Detail & Related papers (2023-10-10T09:04:33Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition [27.530537066239116]
We introduce the concept of relaxed attention, which is a gradual injection of a uniform distribution to the encoder-decoder attention weights during training.
We find that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models.
On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative.
arXiv Detail & Related papers (2021-07-02T21:01:17Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.