Related papers: On the Importance of Local Information in Transformer Based Models

On the Importance of Local Information in Transformer Based Models

URL: http://arxiv.org/abs/2008.05828v1
Date: Thu, 13 Aug 2020 11:32:47 GMT
Title: On the Importance of Local Information in Transformer Based Models
Authors: Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar, Mitesh M. Khapra
Abstract summary: The self-attention module is a key component of Transformer-based models. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias.
Score: 19.036044858449593
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The self-attention module is a key component of Transformer-based models, wherein each token pays attention to every other token. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. Some studies have also identified promise in restricting this attention to be local, i.e., a token attending to other tokens only in a small neighbourhood around it. However, no conclusive evidence exists that such local attention alone is sufficient to achieve high accuracy on multiple NLP tasks. In this work, we systematically analyse the role of locality information in learnt models and contrast it with the role of syntactic information. More specifically, we first do a sensitivity analysis and show that, at every layer, the representation of a token is much more sensitive to tokens in a small neighborhood around it than to tokens which are syntactically related to it. We then define an attention bias metric to determine whether a head pays more attention to local tokens or to syntactically related tokens. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias. Having established the importance of local attention heads, we train and evaluate models where varying fractions of the attention heads are constrained to be local. Such models would be more efficient as they would have fewer computations in the attention layer. We evaluate these models on 4 GLUE datasets (QQP, SST-2, MRPC, QNLI) and 2 MT datasets (En-De, En-Ru) and clearly demonstrate that such constrained models have comparable performance to the unconstrained models. Through this systematic evaluation we establish that attention in Transformer-based models can be constrained to be local without affecting performance.

Related papers

Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets. ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification. We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z)
Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling. Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
FedDistill: Global Model Distillation for Local Model De-Biasing in Non-IID Federated Learning [10.641875933652647]
Federated Learning (FL) is a novel approach that allows for collaborative machine learning. FL faces challenges due to non-uniformly distributed (non-iid) data across clients. This paper introduces FedDistill, a framework enhancing the knowledge transfer from the global model to local models.
arXiv Detail & Related papers (2024-04-14T10:23:30Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Federated Learning of Models Pre-Trained on Different Features with Consensus Graphs [19.130197923214123]
Learning an effective global model on private and decentralized datasets has become an increasingly important challenge of machine learning. We propose a feature fusion approach that extracts local representations from local models and incorporates them into a global representation that improves the prediction performance. This paper presents solutions to these problems and demonstrates them in real-world applications on time series data such as power grids and traffic networks.
arXiv Detail & Related papers (2023-06-02T02:24:27Z)
Adaptive Local-Component-aware Graph Convolutional Network for One-shot Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition. Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z)
S2RL: Do We Really Need to Perceive All States in Deep Multi-Agent Reinforcement Learning? [26.265100805551764]
Collaborative multi-agent reinforcement learning (MARL) has been widely used in many practical applications. We propose a sparse state based MARL framework, which utilizes a sparse attention mechanism to discard irrelevant information in local observations.
arXiv Detail & Related papers (2022-06-20T07:33:40Z)
Contrastive Neighborhood Alignment [81.65103777329874]
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features. The target model aims to mimic the local structure of the source representation space using a contrastive loss. CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one.
arXiv Detail & Related papers (2022-01-06T04:58:31Z)
Improve the Interpretability of Attention: A Fast, Accurate, and Interpretable High-Resolution Attention Model [6.906621279967867]
We propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information. The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved. It is also more accurate, faster, and with a smaller memory footprint than usual neural attention modules.
arXiv Detail & Related papers (2021-06-04T15:57:37Z)
PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space. Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z)
LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. We propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)
Attention improves concentration when learning node embeddings [1.2233362977312945]
Given nodes labelled with search query text, we want to predict links to related queries that share products. Experiments with a range of deep neural architectures show that simple feedforward networks with an attention mechanism perform best for learning embeddings. We propose an analytically tractable model of query generation, AttEST, that views both products and the query text as vectors embedded in a latent space.
arXiv Detail & Related papers (2020-06-11T21:21:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.