Related papers: On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

URL: http://arxiv.org/abs/2011.04906v1
Date: Sun, 8 Nov 2020 16:01:38 GMT
Title: On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers
Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals
Abstract summary: We train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard. Compared to baseline Transformers, no performance drop but minor gains are observed. We conclude the global view is unnecessary in training upper encoder layers.
Score: 40.991809705930955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a question: for speech recognition, is a global view of the entire sequence useful for the upper self-attention encoder layers in Transformers? To investigate this, we train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard. Compared to baseline Transformers, no performance drop but minor gains are observed. We further developed a novel metric of the diagonality of attention matrices and found the learned diagonality indeed increases from the lower to upper encoder self-attention layers. We conclude the global view is unnecessary in training upper encoder layers.

Related papers

On-Chip Learning via Transformer In-Context Learning [0.9353041869660692]
Self-attention mechanism requires transferring prior token projections from the main memory at each time step. We present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention.
arXiv Detail & Related papers (2024-10-11T10:54:09Z)
Manifold-Preserving Transformers are Effective for Short-Long Range Encoding [39.14128923434994]
Multi-head self-attention-based Transformers have shown promise in different learning tasks. We propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens.
arXiv Detail & Related papers (2023-10-22T06:58:28Z)
Self-supervised Rewiring of Pre-trained Speech Encoders: Towards Faster Fine-tuning with Less Labels in Speech Processing [66.92823764664206]
We take a sober look into pre-trained speech encoders and rewire their representation space without requiring task-specific labels. Our experiments on 6 speech processing tasks, exhibit a significant convergence speedup during task fine-tuning as well as consistent task improvement.
arXiv Detail & Related papers (2022-10-24T08:27:09Z)
Regularizing Transformers With Deep Probabilistic Layers [62.997667081978825]
In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models. We prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
arXiv Detail & Related papers (2021-08-23T10:17:02Z)
Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)
When Can Self-Attention Be Replaced by Feed Forward Layers? [40.991809705930955]
We show that replacing the upper self-attention layers in the encoder with feed forward layers leads to no performance drop, and even minor gains. Our experiments offer insights to how self-attention layers process the speech signal.
arXiv Detail & Related papers (2020-05-28T10:35:49Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss [14.755108017449295]
We present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention makes decoding computationally tractable for streaming.
arXiv Detail & Related papers (2020-02-07T00:04:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.