When Can Self-Attention Be Replaced by Feed Forward Layers?
- URL: http://arxiv.org/abs/2005.13895v1
- Date: Thu, 28 May 2020 10:35:49 GMT
- Title: When Can Self-Attention Be Replaced by Feed Forward Layers?
- Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals
- Abstract summary: We show that replacing the upper self-attention layers in the encoder with feed forward layers leads to no performance drop, and even minor gains.
Our experiments offer insights to how self-attention layers process the speech signal.
- Score: 40.991809705930955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, self-attention models such as Transformers have given competitive
results compared to recurrent neural network systems in speech recognition. The
key factor for the outstanding performance of self-attention models is their
ability to capture temporal relationships without being limited by the distance
between two related events. However, we note that the range of the learned
context progressively increases from the lower to upper self-attention layers,
whilst acoustic events often happen within short time spans in a left-to-right
order. This leads to a question: for speech recognition, is a global view of
the entire sequence still important for the upper self-attention layers in the
encoder of Transformers? To investigate this, we replace these self-attention
layers with feed forward layers. In our speech recognition experiments (Wall
Street Journal and Switchboard), we indeed observe an interesting result:
replacing the upper self-attention layers in the encoder with feed forward
layers leads to no performance drop, and even minor gains. Our experiments
offer insights to how self-attention layers process the speech signal, leading
to the conclusion that the lower self-attention layers of the encoder encode a
sufficiently wide range of inputs, hence learning further contextual
information in the upper layers is unnecessary.
Related papers
- Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining [0.7723409004662979]
Finetuning all layers of the learned model leads to lower performance compared to resetting top layers.
We study the evolution of high-level information within the model during pretraining.
arXiv Detail & Related papers (2024-05-14T07:55:37Z) - Scan and Snap: Understanding Training Dynamics and Token Composition in
1-layer Transformer [37.37547759817417]
Transformer architecture has shown impressive performance in multiple research domains.
We analyze its SGD training dynamics for the task of next token prediction.
We prove that self-attention acts as a emphdiscriminative scanning algorithm.
arXiv Detail & Related papers (2023-05-25T15:59:13Z) - Surrogate Gradient Spiking Neural Networks as Encoders for Large
Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method.
They have shown promising results on speech command recognition tasks.
In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z) - Self-supervised Rewiring of Pre-trained Speech Encoders: Towards Faster
Fine-tuning with Less Labels in Speech Processing [66.92823764664206]
We take a sober look into pre-trained speech encoders and rewire their representation space without requiring task-specific labels.
Our experiments on 6 speech processing tasks, exhibit a significant convergence speedup during task fine-tuning as well as consistent task improvement.
arXiv Detail & Related papers (2022-10-24T08:27:09Z) - On the Usefulness of Self-Attention for Automatic Speech Recognition
with Transformers [40.991809705930955]
We train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard.
Compared to baseline Transformers, no performance drop but minor gains are observed.
We conclude the global view is unnecessary in training upper encoder layers.
arXiv Detail & Related papers (2020-11-08T16:01:38Z) - Self-Attention Generative Adversarial Network for Speech Enhancement [37.14341228976058]
Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation.
We propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN.
Experiments show that introducing self-attention to SEGAN leads to consistent improvement across the objective evaluation metrics of enhancement performance.
arXiv Detail & Related papers (2020-10-18T22:59:07Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.