Understanding Self-Attention of Self-Supervised Audio Transformers
- URL: http://arxiv.org/abs/2006.03265v2
- Date: Mon, 10 Aug 2020 18:48:41 GMT
- Title: Understanding Self-Attention of Self-Supervised Audio Transformers
- Authors: Shu-wen Yang, Andy T. Liu, Hung-yi Lee
- Abstract summary: Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet.
In this work, we present multiple strategies for the analysis of attention mechanisms in SAT.
- Score: 74.38550595045855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised Audio Transformers (SAT) enable great success in many
downstream speech applications like ASR, but how they work has not been widely
explored yet. In this work, we present multiple strategies for the analysis of
attention mechanisms in SAT. We categorize attentions into explainable
categories, where we discover each category possesses its own unique
functionality. We provide a visualization tool for understanding multi-head
self-attention, importance ranking strategies for identifying critical
attention, and attention refinement techniques to improve model performance.
Related papers
- AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers.
The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention.
We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Pay Self-Attention to Audio-Visual Navigation [24.18976027602831]
We propose an end-to-end framework to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy.
Our thorough experiments validate the superior performance of FSAAVN in comparison with the state-of-the-arts.
arXiv Detail & Related papers (2022-10-04T03:42:36Z) - Adaptive Sparse and Monotonic Attention for Transformer-based Automatic
Speech Recognition [32.45255303465946]
We introduce sparse attention and monotonic attention into Transformer-based ASR.
The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
arXiv Detail & Related papers (2022-09-30T01:55:57Z) - Improving Speech Emotion Recognition Through Focus and Calibration
Attention Mechanisms [0.5994412766684842]
We identify misalignments between the attention and the signal amplitude in the existing multi-head self-attention.
We propose to use a Focus-Attention (FA) mechanism and a novel-Attention (CA) mechanism in combination with the multi-head self-attention.
By employing the CA mechanism, the network can modulate the information flow by assigning different weights to each attention head and improve the utilization of surrounding contexts.
arXiv Detail & Related papers (2022-08-21T08:04:22Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - Input-independent Attention Weights Are Expressive Enough: A Study of
Attention in Self-supervised Audio Transformers [55.40032342541187]
We pre-train a transformer-based model with attention algorithms in a self-supervised fashion and treat them as feature extractors on downstream tasks.
Our approach shows comparable performance to the typical self-attention yet requires 20% less time in both training and inference.
arXiv Detail & Related papers (2020-06-09T10:40:52Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.