Focus on the present: a regularization method for the ASR source-target
attention layer
- URL: http://arxiv.org/abs/2011.01210v1
- Date: Mon, 2 Nov 2020 18:56:33 GMT
- Title: Focus on the present: a regularization method for the ASR source-target
attention layer
- Authors: Nanxin Chen, Piotr \.Zelasko, Jes\'us Villalba, Najim Dehak
- Abstract summary: This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models.
Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations.
We found that the source-target attention heads are able to predict several tokens ahead of the current one.
- Score: 45.73441417132897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel method to diagnose the source-target attention
in state-of-the-art end-to-end speech recognition models with joint
connectionist temporal classification (CTC) and attention training. Our method
is based on the fact that both, CTC and source-target attention, are acting on
the same encoder representations. To understand the functionality of the
attention, CTC is applied to compute the token posteriors given the attention
outputs. We found that the source-target attention heads are able to predict
several tokens ahead of the current one. Inspired by the observation, a new
regularization method is proposed which leverages CTC to make source-target
attention more focused on the frames corresponding to the output token being
predicted by the decoder. Experiments reveal stable improvements up to 7\% and
13\% relatively with the proposed regularization on TED-LIUM 2 and LibriSpeech.
Related papers
- Sequential Attention Source Identification Based on Feature
Representation [88.05527934953311]
This paper proposes a sequence-to-sequence based localization framework called Temporal-sequence based Graph Attention Source Identification (TGASI) based on an inductive learning idea.
It's worth mentioning that the inductive learning idea ensures that TGASI can detect the sources in new scenarios without knowing other prior knowledge.
arXiv Detail & Related papers (2023-06-28T03:00:28Z) - BERT Meets CTC: New Formulation of End-to-End Speech Recognition with
Pre-trained Masked Language Model [40.16332045057132]
BERT-CTC is a novel formulation of end-to-end speech recognition.
It incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding.
BERT-CTC improves over conventional approaches across variations in speaking styles and languages.
arXiv Detail & Related papers (2022-10-29T18:19:44Z) - ATCON: Attention Consistency for Vision Models [0.8312466807725921]
We propose an unsupervised fine-tuning method that improves the consistency of attention maps.
We show results on Grad-CAM and Integrated Gradients in an ablation study.
Those improved attention maps may help clinicians better understand vision model predictions.
arXiv Detail & Related papers (2022-10-18T09:30:20Z) - Personalization of CTC Speech Recognition Models [15.470660345766445]
We propose a novel two-way approach that first biases the encoder with attention over a list of rare long-tail and out-of-vocabulary words.
We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words.
arXiv Detail & Related papers (2022-10-18T01:08:21Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning [141.38505371646482]
Cross-modal correlation provides an inherent supervision for video unsupervised representation learning.
This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property.
CMAC aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal.
arXiv Detail & Related papers (2021-06-13T07:41:15Z) - More Than Just Attention: Learning Cross-Modal Attentions with
Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation.
CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations.
Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z) - GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text
Recognition [27.38969404322089]
We propose the guided training of CTC model, where CTC model learns a better alignment and feature representations from a more powerful attentional guidance.
With the benefit of guided training, CTC model achieves robust and accurate prediction for both regular and irregular scene text.
To further leverage the potential of CTC decoder, a graph convolutional network (GCN) is proposed to learn the local correlations of extracted features.
arXiv Detail & Related papers (2020-02-04T13:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.