On the Importance of Local Information in Transformer Based Models
- URL: http://arxiv.org/abs/2008.05828v1
- Date: Thu, 13 Aug 2020 11:32:47 GMT
- Title: On the Importance of Local Information in Transformer Based Models
- Authors: Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar, Mitesh
M. Khapra
- Abstract summary: The self-attention module is a key component of Transformer-based models.
Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour.
We show that a larger fraction of heads have a locality bias as compared to a syntactic bias.
- Score: 19.036044858449593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The self-attention module is a key component of Transformer-based models,
wherein each token pays attention to every other token. Recent studies have
shown that these heads exhibit syntactic, semantic, or local behaviour. Some
studies have also identified promise in restricting this attention to be local,
i.e., a token attending to other tokens only in a small neighbourhood around
it. However, no conclusive evidence exists that such local attention alone is
sufficient to achieve high accuracy on multiple NLP tasks. In this work, we
systematically analyse the role of locality information in learnt models and
contrast it with the role of syntactic information. More specifically, we first
do a sensitivity analysis and show that, at every layer, the representation of
a token is much more sensitive to tokens in a small neighborhood around it than
to tokens which are syntactically related to it. We then define an attention
bias metric to determine whether a head pays more attention to local tokens or
to syntactically related tokens. We show that a larger fraction of heads have a
locality bias as compared to a syntactic bias. Having established the
importance of local attention heads, we train and evaluate models where varying
fractions of the attention heads are constrained to be local. Such models would
be more efficient as they would have fewer computations in the attention layer.
We evaluate these models on 4 GLUE datasets (QQP, SST-2, MRPC, QNLI) and 2 MT
datasets (En-De, En-Ru) and clearly demonstrate that such constrained models
have comparable performance to the unconstrained models. Through this
systematic evaluation we establish that attention in Transformer-based models
can be constrained to be local without affecting performance.
Related papers
- Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - FedDistill: Global Model Distillation for Local Model De-Biasing in Non-IID Federated Learning [10.641875933652647]
Federated Learning (FL) is a novel approach that allows for collaborative machine learning.
FL faces challenges due to non-uniformly distributed (non-iid) data across clients.
This paper introduces FedDistill, a framework enhancing the knowledge transfer from the global model to local models.
arXiv Detail & Related papers (2024-04-14T10:23:30Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Federated Learning of Models Pre-Trained on Different Features with
Consensus Graphs [19.130197923214123]
Learning an effective global model on private and decentralized datasets has become an increasingly important challenge of machine learning.
We propose a feature fusion approach that extracts local representations from local models and incorporates them into a global representation that improves the prediction performance.
This paper presents solutions to these problems and demonstrates them in real-world applications on time series data such as power grids and traffic networks.
arXiv Detail & Related papers (2023-06-02T02:24:27Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - S2RL: Do We Really Need to Perceive All States in Deep Multi-Agent
Reinforcement Learning? [26.265100805551764]
Collaborative multi-agent reinforcement learning (MARL) has been widely used in many practical applications.
We propose a sparse state based MARL framework, which utilizes a sparse attention mechanism to discard irrelevant information in local observations.
arXiv Detail & Related papers (2022-06-20T07:33:40Z) - Contrastive Neighborhood Alignment [81.65103777329874]
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features.
The target model aims to mimic the local structure of the source representation space using a contrastive loss.
CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one.
arXiv Detail & Related papers (2022-01-06T04:58:31Z) - Improve the Interpretability of Attention: A Fast, Accurate, and
Interpretable High-Resolution Attention Model [6.906621279967867]
We propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information.
The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved.
It is also more accurate, faster, and with a smaller memory footprint than usual neural attention modules.
arXiv Detail & Related papers (2021-06-04T15:57:37Z) - PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image
Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space.
Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z) - Attention improves concentration when learning node embeddings [1.2233362977312945]
Given nodes labelled with search query text, we want to predict links to related queries that share products.
Experiments with a range of deep neural architectures show that simple feedforward networks with an attention mechanism perform best for learning embeddings.
We propose an analytically tractable model of query generation, AttEST, that views both products and the query text as vectors embedded in a latent space.
arXiv Detail & Related papers (2020-06-11T21:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.