Capturing Multi-Resolution Context by Dilated Self-Attention
- URL: http://arxiv.org/abs/2104.02858v1
- Date: Wed, 7 Apr 2021 02:04:18 GMT
- Title: Capturing Multi-Resolution Context by Dilated Self-Attention
- Authors: Niko Moritz, Takaaki Hori, Jonathan Le Roux
- Abstract summary: We propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention.
The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.
ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.
- Score: 58.69803243323346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention has become an important and widely used neural network
component that helped to establish new state-of-the-art results for various
applications, such as machine translation and automatic speech recognition
(ASR). However, the computational complexity of self-attention grows
quadratically with the input sequence length. This can be particularly
problematic for applications such as ASR, where an input sequence generated
from an utterance can be relatively long. In this work, we propose a
combination of restricted self-attention and a dilation mechanism, which we
refer to as dilated self-attention. The restricted self-attention allows
attention to neighboring frames of the query at a high resolution, and the
dilation mechanism summarizes distant information to allow attending to it with
a lower resolution. Different methods for summarizing distant frames are
studied, such as subsampling, mean-pooling, and attention-based pooling. ASR
results demonstrate substantial improvements compared to restricted
self-attention alone, achieving similar results compared to full-sequence based
self-attention with a fraction of the computational costs.
Related papers
- Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation [39.64103126881576]
We propose a complex-valued T-F attention (TFA) module that models spectral and temporal dependencies.
We validate the effectiveness of our proposed complex-valued TFA module with the deep complex convolutional recurrent network (DCCRN) using the REVERB challenge corpus.
Experimental findings indicate that integrating our complex-TFA module with DCCRN improves overall speech quality and performance of back-end speech applications.
arXiv Detail & Related papers (2022-11-22T23:38:10Z) - Adaptive Sparse and Monotonic Attention for Transformer-based Automatic
Speech Recognition [32.45255303465946]
We introduce sparse attention and monotonic attention into Transformer-based ASR.
The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
arXiv Detail & Related papers (2022-09-30T01:55:57Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Untangling tradeoffs between recurrence and self-attention in neural
networks [81.30894993852813]
We present a formal analysis of how self-attention affects gradient propagation in recurrent networks.
We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies.
We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
arXiv Detail & Related papers (2020-06-16T19:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.