What Context Features Can Transformer Language Models Use?
- URL: http://arxiv.org/abs/2106.08367v1
- Date: Tue, 15 Jun 2021 18:38:57 GMT
- Title: What Context Features Can Transformer Language Models Use?
- Authors: Joe O'Connor and Jacob Andreas
- Abstract summary: We measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia.
In both mid- and long-range contexts, we find that several extremely destructive context manipulations remove less than 15% of the usable information.
- Score: 32.49689188570872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based language models benefit from conditioning on contexts of
hundreds to thousands of previous tokens. What aspects of these contexts
contribute to accurate model prediction? We describe a series of experiments
that measure usable information by selectively ablating lexical and structural
information in transformer language models trained on English Wikipedia. In
both mid- and long-range contexts, we find that several extremely destructive
context manipulations -- including shuffling word order within sentences and
deleting all words other than nouns -- remove less than 15% of the usable
information. Our results suggest that long contexts, but not their detailed
syntactic and propositional content, are important for the low perplexity of
current transformer language models.
Related papers
- Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification [1.6021932740447968]
Analyses of transformer-based models have shown that they encode a variety of linguistic information from their textual input.
We test to what degree information about chunks (in particular noun, verb or prepositional phrases) can be localized in sentence embeddings.
Our results show that such information is not distributed over the entire sentence embedding, but rather it is encoded in specific regions.
arXiv Detail & Related papers (2024-07-25T15:27:08Z) - Conditional Language Learning with Context [19.708303468664088]
We propose a simple modification to causal language modeling called conditional finetuning.
We show that a context can "explain away" certain corpus statistics and make the model avoid learning them.
arXiv Detail & Related papers (2024-06-04T05:22:24Z) - Explaining How Transformers Use Context to Build Predictions [0.1749935196721634]
Language Generation Models produce words based on the previous context.
It is still unclear how prior words affect the model's decision throughout the layers.
We leverage recent advances in explainability of the Transformer and present a procedure to analyze models for language generation.
arXiv Detail & Related papers (2023-05-21T18:29:10Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z) - Do Context-Aware Translation Models Pay the Right Attention? [61.25804242929533]
Context-aware machine translation models are designed to leverage contextual information, but often fail to do so.
In this paper, we ask several questions: What contexts do human translators use to resolve ambiguous words?
We introduce SCAT (Supporting Context for Ambiguous Translations), a new English-French dataset comprising supporting context words for 14K translations.
Using SCAT, we perform an in-depth analysis of the context used to disambiguate, examining positional and lexical characteristics of the supporting words.
arXiv Detail & Related papers (2021-05-14T17:32:24Z) - Investigating representations of verb bias in neural language models [7.455546102930909]
We introduce DAIS, a benchmark dataset containing 50K human judgments for 5K distinct sentence pairs in the English dative alternation.
This dataset includes 200 unique verbs and systematically varies the definiteness and length of arguments.
We use this dataset, as well as an existing corpus of naturally occurring data, to evaluate how well recent neural language models capture human preferences.
arXiv Detail & Related papers (2020-10-05T22:39:08Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.