Guiding Attention for Self-Supervised Learning with Transformers
- URL: http://arxiv.org/abs/2010.02399v1
- Date: Tue, 6 Oct 2020 00:04:08 GMT
- Title: Guiding Attention for Self-Supervised Learning with Transformers
- Authors: Ameet Deshpande, Karthik Narasimhan
- Abstract summary: We propose a technique to allow for efficient self-supervised learning with bi-directional Transformers.
Our approach is motivated by recent studies demonstrating that self-attention patterns in trained models contain a majority of non-linguistic regularities.
- Score: 24.785500242464646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a simple and effective technique to allow for
efficient self-supervised learning with bi-directional Transformers. Our
approach is motivated by recent studies demonstrating that self-attention
patterns in trained models contain a majority of non-linguistic regularities.
We propose a computationally efficient auxiliary loss function to guide
attention heads to conform to such patterns. Our method is agnostic to the
actual pre-training objective and results in faster convergence of models as
well as better performance on downstream tasks compared to the baselines,
achieving state of the art results in low-resource settings. Surprisingly, we
also find that linguistic properties of attention heads are not necessarily
correlated with language modeling performance.
Related papers
- Efficient Test-Time Prompt Tuning for Vision-Language Models [41.90997623029582]
Self-TPT is a framework leveraging Self-supervised learning for efficient Test-time Prompt Tuning.
We show that Self-TPT not only significantly reduces inference costs but also achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-11T13:55:58Z) - DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning [75.68193159293425]
In-context learning (ICL) allows transformer-based language models to learn a specific task with a few "task demonstrations" without updating their parameters.
We propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL.
We experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
arXiv Detail & Related papers (2024-05-22T15:52:52Z) - Entailment as Robust Self-Learner [14.86757876218415]
We design a prompting strategy that formulates a number of different NLU tasks as contextual entailment.
We propose the Simple Pseudo-Label Editing (SimPLE) algorithm for better pseudo-labeling quality in self-training.
arXiv Detail & Related papers (2023-05-26T18:41:23Z) - Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance.
Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Learning Rich Nearest Neighbor Representations from Self-supervised
Ensembles [60.97922557957857]
We provide a framework to perform self-supervised model ensembling via a novel method of learning representations directly through gradient descent at inference time.
This technique improves representation quality, as measured by k-nearest neighbors, both on the in-domain dataset and in the transfer setting.
arXiv Detail & Related papers (2021-10-19T22:24:57Z) - Distantly-Supervised Named Entity Recognition with Noise-Robust Learning
and Language Model Augmented Self-Training [66.80558875393565]
We study the problem of training named entity recognition (NER) models using only distantly-labeled data.
We propose a noise-robust learning scheme comprised of a new loss function and a noisy label removal step.
Our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
arXiv Detail & Related papers (2021-09-10T17:19:56Z) - Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models.
Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z) - On Learning Text Style Transfer with Direct Rewards [101.97136885111037]
Lack of parallel corpora makes it impossible to directly train supervised models for the text style transfer task.
We leverage semantic similarity metrics originally used for fine-tuning neural machine translation models.
Our model provides significant gains in both automatic and human evaluation over strong baselines.
arXiv Detail & Related papers (2020-10-24T04:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.