Related papers: Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers

Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers

URL: http://arxiv.org/abs/2006.05174v2
Date: Tue, 3 Nov 2020 06:32:17 GMT
Title: Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers
Authors: Tsung-Han Wu, Chun-Chen Hsieh, Yen-Hao Chen, Po-Han Chi, Hung-yi Lee
Abstract summary: We pre-train a transformer-based model with attention algorithms in a self-supervised fashion and treat them as feature extractors on downstream tasks. Our approach shows comparable performance to the typical self-attention yet requires 20% less time in both training and inference.
Score: 55.40032342541187
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we seek solutions for reducing the computation complexity of transformer-based models for speech representation learning. We evaluate 10 attention algorithms; then, we pre-train the transformer-based model with those attention algorithms in a self-supervised fashion and treat them as feature extractors on downstream tasks, including phoneme classification and speaker classification. With the assistance of t-SNE, PCA and some observation, the attention weights in self-supervised audio transformers can be categorized into four general cases. Based on these cases and some analyses, we are able to use a specific set of attention weights to initialize the model. Our approach shows comparable performance to the typical self-attention yet requires 20% less time in both training and inference.

Related papers

On-Chip Learning via Transformer In-Context Learning [0.9353041869660692]
Self-attention mechanism requires transferring prior token projections from the main memory at each time step. We present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention.
arXiv Detail & Related papers (2024-10-11T10:54:09Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Visual Transformers for Primates Classification and Covid Detection [8.747840760772268]
We apply the vision transformer, a deep machine learning model build around the attention mechanism, on mel-spectrogram representations of raw audio recordings. When adding mel-based data augmentation techniques and sample-weighting, we achieve comparable performance on both (PRS and CCS challenge) tasks of ComParE21.
arXiv Detail & Related papers (2022-12-20T09:10:25Z)
ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z)
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones. We find that without any input-dependent attention, all models achieve competitive performance. We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z)
Deep Clustering For General-Purpose Audio Representations [2.8086459907382224]
We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset. We transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes.
arXiv Detail & Related papers (2021-10-17T19:03:51Z)
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture. Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.