Focus on the present: a regularization method for the ASR source-target
attention layer
- URL: http://arxiv.org/abs/2011.01210v1
- Date: Mon, 2 Nov 2020 18:56:33 GMT
- Title: Focus on the present: a regularization method for the ASR source-target
attention layer
- Authors: Nanxin Chen, Piotr \.Zelasko, Jes\'us Villalba, Najim Dehak
- Abstract summary: This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models.
Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations.
We found that the source-target attention heads are able to predict several tokens ahead of the current one.
- Score: 45.73441417132897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel method to diagnose the source-target attention
in state-of-the-art end-to-end speech recognition models with joint
connectionist temporal classification (CTC) and attention training. Our method
is based on the fact that both, CTC and source-target attention, are acting on
the same encoder representations. To understand the functionality of the
attention, CTC is applied to compute the token posteriors given the attention
outputs. We found that the source-target attention heads are able to predict
several tokens ahead of the current one. Inspired by the observation, a new
regularization method is proposed which leverages CTC to make source-target
attention more focused on the frames corresponding to the output token being
predicted by the decoder. Experiments reveal stable improvements up to 7\% and
13\% relatively with the proposed regularization on TED-LIUM 2 and LibriSpeech.
Related papers
- ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens.
When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens.
Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - CR-CTC: Consistency regularization on CTC for improved speech recognition [18.996929774821822]
Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR)
However, it often falls short in recognition performance compared to transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED)
We propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram.
arXiv Detail & Related papers (2024-10-07T14:56:07Z) - Sequential Attention Source Identification Based on Feature
Representation [88.05527934953311]
This paper proposes a sequence-to-sequence based localization framework called Temporal-sequence based Graph Attention Source Identification (TGASI) based on an inductive learning idea.
It's worth mentioning that the inductive learning idea ensures that TGASI can detect the sources in new scenarios without knowing other prior knowledge.
arXiv Detail & Related papers (2023-06-28T03:00:28Z) - ATCON: Attention Consistency for Vision Models [0.8312466807725921]
We propose an unsupervised fine-tuning method that improves the consistency of attention maps.
We show results on Grad-CAM and Integrated Gradients in an ablation study.
Those improved attention maps may help clinicians better understand vision model predictions.
arXiv Detail & Related papers (2022-10-18T09:30:20Z) - Personalization of CTC Speech Recognition Models [15.470660345766445]
We propose a novel two-way approach that first biases the encoder with attention over a list of rare long-tail and out-of-vocabulary words.
We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words.
arXiv Detail & Related papers (2022-10-18T01:08:21Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning [141.38505371646482]
Cross-modal correlation provides an inherent supervision for video unsupervised representation learning.
This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property.
CMAC aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal.
arXiv Detail & Related papers (2021-06-13T07:41:15Z) - GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text
Recognition [27.38969404322089]
We propose the guided training of CTC model, where CTC model learns a better alignment and feature representations from a more powerful attentional guidance.
With the benefit of guided training, CTC model achieves robust and accurate prediction for both regular and irregular scene text.
To further leverage the potential of CTC decoder, a graph convolutional network (GCN) is proposed to learn the local correlations of extracted features.
arXiv Detail & Related papers (2020-02-04T13:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.