On the Importance of Local Information in Transformer Based Models
- URL: http://arxiv.org/abs/2008.05828v1
- Date: Thu, 13 Aug 2020 11:32:47 GMT
- Title: On the Importance of Local Information in Transformer Based Models
- Authors: Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar, Mitesh
M. Khapra
- Abstract summary: The self-attention module is a key component of Transformer-based models.
Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour.
We show that a larger fraction of heads have a locality bias as compared to a syntactic bias.
- Score: 19.036044858449593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The self-attention module is a key component of Transformer-based models,
wherein each token pays attention to every other token. Recent studies have
shown that these heads exhibit syntactic, semantic, or local behaviour. Some
studies have also identified promise in restricting this attention to be local,
i.e., a token attending to other tokens only in a small neighbourhood around
it. However, no conclusive evidence exists that such local attention alone is
sufficient to achieve high accuracy on multiple NLP tasks. In this work, we
systematically analyse the role of locality information in learnt models and
contrast it with the role of syntactic information. More specifically, we first
do a sensitivity analysis and show that, at every layer, the representation of
a token is much more sensitive to tokens in a small neighborhood around it than
to tokens which are syntactically related to it. We then define an attention
bias metric to determine whether a head pays more attention to local tokens or
to syntactically related tokens. We show that a larger fraction of heads have a
locality bias as compared to a syntactic bias. Having established the
importance of local attention heads, we train and evaluate models where varying
fractions of the attention heads are constrained to be local. Such models would
be more efficient as they would have fewer computations in the attention layer.
We evaluate these models on 4 GLUE datasets (QQP, SST-2, MRPC, QNLI) and 2 MT
datasets (En-De, En-Ru) and clearly demonstrate that such constrained models
have comparable performance to the unconstrained models. Through this
systematic evaluation we establish that attention in Transformer-based models
can be constrained to be local without affecting performance.
Related papers
- Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets.
ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification.
We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z) - Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling.
Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - FedDistill: Global Model Distillation for Local Model De-Biasing in Non-IID Federated Learning [10.641875933652647]
Federated Learning (FL) is a novel approach that allows for collaborative machine learning.
FL faces challenges due to non-uniformly distributed (non-iid) data across clients.
This paper introduces FedDistill, a framework enhancing the knowledge transfer from the global model to local models.
arXiv Detail & Related papers (2024-04-14T10:23:30Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Federated Learning of Models Pre-Trained on Different Features with
Consensus Graphs [19.130197923214123]
Learning an effective global model on private and decentralized datasets has become an increasingly important challenge of machine learning.
We propose a feature fusion approach that extracts local representations from local models and incorporates them into a global representation that improves the prediction performance.
This paper presents solutions to these problems and demonstrates them in real-world applications on time series data such as power grids and traffic networks.
arXiv Detail & Related papers (2023-06-02T02:24:27Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - S2RL: Do We Really Need to Perceive All States in Deep Multi-Agent
Reinforcement Learning? [26.265100805551764]
Collaborative multi-agent reinforcement learning (MARL) has been widely used in many practical applications.
We propose a sparse state based MARL framework, which utilizes a sparse attention mechanism to discard irrelevant information in local observations.
arXiv Detail & Related papers (2022-06-20T07:33:40Z) - Contrastive Neighborhood Alignment [81.65103777329874]
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features.
The target model aims to mimic the local structure of the source representation space using a contrastive loss.
CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one.
arXiv Detail & Related papers (2022-01-06T04:58:31Z) - Improve the Interpretability of Attention: A Fast, Accurate, and
Interpretable High-Resolution Attention Model [6.906621279967867]
We propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information.
The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved.
It is also more accurate, faster, and with a smaller memory footprint than usual neural attention modules.
arXiv Detail & Related papers (2021-06-04T15:57:37Z) - PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image
Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space.
Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.