Input-independent Attention Weights Are Expressive Enough: A Study of
Attention in Self-supervised Audio Transformers
- URL: http://arxiv.org/abs/2006.05174v2
- Date: Tue, 3 Nov 2020 06:32:17 GMT
- Title: Input-independent Attention Weights Are Expressive Enough: A Study of
Attention in Self-supervised Audio Transformers
- Authors: Tsung-Han Wu, Chun-Chen Hsieh, Yen-Hao Chen, Po-Han Chi, Hung-yi Lee
- Abstract summary: We pre-train a transformer-based model with attention algorithms in a self-supervised fashion and treat them as feature extractors on downstream tasks.
Our approach shows comparable performance to the typical self-attention yet requires 20% less time in both training and inference.
- Score: 55.40032342541187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we seek solutions for reducing the computation complexity of
transformer-based models for speech representation learning. We evaluate 10
attention algorithms; then, we pre-train the transformer-based model with those
attention algorithms in a self-supervised fashion and treat them as feature
extractors on downstream tasks, including phoneme classification and speaker
classification. With the assistance of t-SNE, PCA and some observation, the
attention weights in self-supervised audio transformers can be categorized into
four general cases. Based on these cases and some analyses, we are able to use
a specific set of attention weights to initialize the model. Our approach shows
comparable performance to the typical self-attention yet requires 20% less time
in both training and inference.
Related papers
- On-Chip Learning via Transformer In-Context Learning [0.9353041869660692]
Self-attention mechanism requires transferring prior token projections from the main memory at each time step.
We present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention.
arXiv Detail & Related papers (2024-10-11T10:54:09Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Visual Transformers for Primates Classification and Covid Detection [8.747840760772268]
We apply the vision transformer, a deep machine learning model build around the attention mechanism, on mel-spectrogram representations of raw audio recordings.
When adding mel-based data augmentation techniques and sample-weighting, we achieve comparable performance on both (PRS and CCS challenge) tasks of ComParE21.
arXiv Detail & Related papers (2022-12-20T09:10:25Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Deep Clustering For General-Purpose Audio Representations [2.8086459907382224]
We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations.
We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset.
We transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes.
arXiv Detail & Related papers (2021-10-17T19:03:51Z) - Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture.
Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.