Related papers: Transformers perform adaptive partial pooling

Transformers perform adaptive partial pooling

URL: http://arxiv.org/abs/2602.03980v1
Date: Tue, 03 Feb 2026 20:05:01 GMT
Title: Transformers perform adaptive partial pooling
Authors: Vsevolod Kapatsinski,
Abstract summary: In hierarchical regression, the model's predictions for behavior in a context are affected by observations from other similar contexts.<n>This is called adaptive partial pooling of evidence.<n>This paper shows that next-word predictions of a transformer (GPT2) are increasingly unaffected by observations from outside the current context.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Because language is creative, any reasonable language model must generalize, deciding what to say in novel contexts by using information from similar contexts. But what about contexts that are not novel but merely infrequent? In hierarchical regression, the model's predictions for behavior in a context are affected by observations from other similar contexts to the extent that 1) the current context is infrequent and 2) different contexts behave similarly. This is called adaptive partial pooling of evidence. This paper shows that next-word predictions of a transformer (GPT2) are increasingly unaffected by observations from outside the current context across epochs of training (the amount of pooling reduces with training), and that the extent of pooling is affected by context frequency, context number (type frequency) and context variability in a similar way to hierarchical regression. These characteristics of learning in transformers are argued to be realistic on both rational and empirical grounds.

Related papers

When Does Context Help? Error Dynamics of Contextual Information in Large Language Models [64.88201012057822]
We present a unified theoretical framework for analyzing the effect of arbitrary contextual information in large language models.<n>Our analysis characterizes contextual influence through output error dynamics.<n> Experiments across ICL, retrieval-augmented generation, and memory evolution validate our theory and motivate a principled context selection strategy.
arXiv Detail & Related papers (2026-02-09T05:58:41Z)
Counterfactual reasoning: an analysis of in-context emergence [57.118735341305786]
We show that language models are capable of counterfactual reasoning.<n>We find that self-attention, model depth and pre-training data diversity drive performance.<n>Our findings extend to counterfactual reasoning under SDE dynamics.
arXiv Detail & Related papers (2025-06-05T16:02:07Z)
Toward Understanding In-context vs. In-weight Learning [50.24035812301655]
We identify simplified distributional properties that give rise to the emergence and disappearance of in-context learning.<n>We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.
arXiv Detail & Related papers (2024-10-30T14:09:00Z)
On the Role of Context in Reading Time Prediction [50.87306355705826]
We present a new perspective on how readers integrate context during real-time language comprehension.<n>Our proposals build on surprisal theory, which posits that the processing effort of a linguistic unit is an affine function of its in-context information content.
arXiv Detail & Related papers (2024-09-12T15:52:22Z)
Class Is Invariant to Context and Vice Versa: On Learning Invariance for Out-Of-Distribution Generalization [85.21263480129056]
We argue that the widely adopted assumption in prior work, the context bias can be directly annotated or estimated from biased class prediction.<n>In contrast, we point out the everoverlooked other side of the above principle: context is also invariant to class.<n>We implement this idea by minimizing the contrastive loss of intra-class sample similarity while assuring this similarity to be invariant across all classes.
arXiv Detail & Related papers (2022-08-06T08:09:54Z)
Mixed-effects transformers for hierarchical adaptation [1.9105318290910576]
We introduce the mixed-effects transformer (MET), a novel approach for learning hierarchically-structured prefixes. We show how the popular class of mixed-effects models may be extended to transformer-based architectures.
arXiv Detail & Related papers (2022-05-03T19:34:15Z)
Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing [18.91129968022831]
Interpretability methods need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text.
arXiv Detail & Related papers (2021-08-11T02:07:21Z)
What Context Features Can Transformer Language Models Use? [32.49689188570872]
We measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid- and long-range contexts, we find that several extremely destructive context manipulations remove less than 15% of the usable information.
arXiv Detail & Related papers (2021-06-15T18:38:57Z)
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests [87.60900567941428]
A spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character. We study stress testing using the tools of causal inference.
arXiv Detail & Related papers (2021-05-31T14:39:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.