Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features
- URL: http://arxiv.org/abs/2402.02969v2
- Date: Fri, 17 May 2024 15:31:34 GMT
- Title: Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features
- Authors: Simone Bombari, Marco Mondelli,
- Abstract summary: Our work studies word sensitivity (WS) in the prototypical setting of random features.
We show that attention layers enjoy high WS, namely, there exists a vector in the space of embeddings that largely perturbs the random attention features map.
We then translate these results on the word sensitivity into generalization bounds.
- Score: 19.261178173399784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the reasons behind the exceptional success of transformers requires a better analysis of why attention layers are suitable for NLP tasks. In particular, such tasks require predictive models to capture contextual meaning which often depends on one or few words, even if the sentence is long. Our work studies this key property, dubbed word sensitivity (WS), in the prototypical setting of random features. We show that attention layers enjoy high WS, namely, there exists a vector in the space of embeddings that largely perturbs the random attention features map. The argument critically exploits the role of the softmax in the attention layer, highlighting its benefit compared to other activations (e.g., ReLU). In contrast, the WS of standard random features is of order $1/\sqrt{n}$, $n$ being the number of words in the textual sample, and thus it decays with the length of the context. We then translate these results on the word sensitivity into generalization bounds: due to their low WS, random features provably cannot learn to distinguish between two sentences that differ only in a single word; in contrast, due to their high WS, random attention features have higher generalization capabilities. We validate our theoretical results with experimental evidence over the BERT-Base word embeddings of the imdb review dataset.
Related papers
- Probing Context Localization of Polysemous Words in Pre-trained Language Model Sub-Layers [12.610445666406898]
We investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM)
To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs.
We also try to empirically localize the strength of contextualization information encoded in these sub-layer representations.
arXiv Detail & Related papers (2024-09-21T10:42:07Z) - Searching for Discriminative Words in Multidimensional Continuous
Feature Space [0.0]
We propose a novel method to extract discriminative keywords from documents.
We show how different discriminative metrics influence the overall results.
We conclude that word feature vectors can substantially improve the topical inference of documents' meaning.
arXiv Detail & Related papers (2022-11-26T18:05:11Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Out of Order: How important is the sequential order of words in a
sentence in Natural Language Understanding tasks? [34.18339528128342]
We find that state-of-the-art natural language understanding models don't care about word order when making predictions.
BERT-based models exploit superficial cues to make correct decisions when tokens are arranged in random orders.
Our work suggests that many GLUE tasks are not challenging machines to understand the meaning of a sentence.
arXiv Detail & Related papers (2020-12-30T14:56:12Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - R$^2$-Net: Relation of Relation Learning Network for Sentence Semantic
Matching [58.72111690643359]
We propose a Relation of Relation Learning Network (R2-Net) for sentence semantic matching.
We first employ BERT to encode the input sentences from a global perspective.
Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective.
To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task.
arXiv Detail & Related papers (2020-12-16T13:11:30Z) - Subjective Question Answering: Deciphering the inner workings of
Transformers in the realm of subjectivity [0.0]
I've exploited a recently released dataset for span-selection Question Answering, namely SubjQA.
SubjQA is the first dataset that contains questions that ask for subjective opinions corresponding to review paragraphs from six different domains.
I've investigated the inner workings of a Transformer-based architecture to contribute to a better understanding of these not yet well understood "black-box" models.
arXiv Detail & Related papers (2020-06-02T13:48:14Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.