Paying More Attention to Self-attention: Improving Pre-trained Language
Models via Attention Guiding
- URL: http://arxiv.org/abs/2204.02922v1
- Date: Wed, 6 Apr 2022 16:22:02 GMT
- Title: Paying More Attention to Self-attention: Improving Pre-trained Language
Models via Attention Guiding
- Authors: Shanshan Wang, Zhumin Chen, Zhaochun Ren, Huasheng Liang, Qiang Yan,
Pengjie Ren
- Abstract summary: Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks.
As the core part of PLM, multi-head self-attention is appealing for its ability to jointly attend to information from different positions.
We propose two kinds of attention guiding methods, i.e., map discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG)
- Score: 35.958164594419515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models (PLM) have demonstrated their effectiveness for a
broad range of information retrieval and natural language processing tasks. As
the core part of PLM, multi-head self-attention is appealing for its ability to
jointly attend to information from different positions. However, researchers
have found that PLM always exhibits fixed attention patterns regardless of the
input (e.g., excessively paying attention to [CLS] or [SEP]), which we argue
might neglect important information in the other positions. In this work, we
propose a simple yet effective attention guiding mechanism to improve the
performance of PLM by encouraging attention towards the established goals.
Specifically, we propose two kinds of attention guiding methods, i.e., map
discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG).
The former definitely encourages the diversity among multiple self-attention
heads to jointly attend to information from different representation subspaces,
while the latter encourages self-attention to attend to as many different
positions of the input as possible. We conduct experiments with multiple
general pre-trained models (i.e., BERT, ALBERT, and Roberta) and
domain-specific pre-trained models (i.e., BioBERT, ClinicalBERT, BlueBert, and
SciBERT) on three benchmark datasets (i.e., MultiNLI, MedNLI, and
Cross-genre-IR). Extensive experimental results demonstrate that our proposed
MDG and PDG bring stable performance improvements on all datasets with high
efficiency and low cost.
Related papers
- Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration [15.36841874118801]
We aim to provide a more profound understanding of the existence of attention sinks within large language models (LLMs)
We propose a training-free Attention Technique (ACT) that automatically optimize the attention distributions on the fly during inference in an input-adaptive manner.
ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.
arXiv Detail & Related papers (2024-06-22T07:00:43Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [74.72150542395487]
An inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness.
To address this issue, we propose a novel inference method named Attention Buckets.
arXiv Detail & Related papers (2023-12-07T17:24:51Z) - Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles [95.49699178874683]
We propose DiffDiv, an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs)
We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features.
We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Beyond Just Vision: A Review on Self-Supervised Representation Learning
on Multimodal and Temporal Data [10.006890915441987]
Popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training.
Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models.
We aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data.
arXiv Detail & Related papers (2022-06-06T04:59:44Z) - Dual Cross-Attention Learning for Fine-Grained Visual Categorization and
Object Re-Identification [19.957957963417414]
We propose a dual cross-attention learning (DCAL) algorithm to coordinate with self-attention learning.
First, we propose global-local cross-attention (GLCA) to enhance the interactions between global images and local high-response regions.
Second, we propose pair-wise cross-attention (PWCA) to establish the interactions between image pairs.
arXiv Detail & Related papers (2022-05-04T16:14:26Z) - Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem.
The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other.
Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.