Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models
- URL: http://arxiv.org/abs/2509.24510v2
- Date: Mon, 13 Oct 2025 08:29:09 GMT
- Title: Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models
- Authors: Jonas Hübotter, Patrik Wolf, Alexander Shevchenko, Dennis Jüni, Andreas Krause, Gil Kur,
- Abstract summary: Recent studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT)<n>We propose a model in which TTT achieves a substantially smaller in-distribution test error than global training.<n>We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet.
- Score: 64.02612380298228
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.
Related papers
- NeuroTTT: Bridging Pretraining-Downstream Task Misalignment in EEG Foundation Models via Test-Time Training [6.030518150035875]
This paper introduces a two-stage alignment strategy for EEG foundation models.<n>First, we propose NeuroTTT: a domain-specific self-supervised fine-tuning paradigm.<n>Second, we perform self-supervised test-time training on individual unlabeled test samples.<n>Our approach is the first to unify domain-tuned self-supervision with test-time training in large-scale EEG foundation models.
arXiv Detail & Related papers (2025-09-30T14:14:46Z) - Test-Time Training Provably Improves Transformers as In-context Learners [49.09821664572445]
We investigate a gradient-based TTT algorithm for in-context learning.<n>We train a transformer model on the in-context demonstrations provided in the test prompt.<n>As our empirical contribution, we study the benefits of TTT for TabPFN.
arXiv Detail & Related papers (2025-03-14T20:06:37Z) - BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping [64.8477128397529]
We propose a training-required and training-free test-time adaptation framework.
We maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples.
We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets.
arXiv Detail & Related papers (2024-10-20T15:58:43Z) - IT$^3$: Idempotent Test-Time Training [95.78053599609044]
Deep learning models often struggle when deployed in real-world settings due to distribution shifts between training and test data.<n>We present Idempotent Test-Time Training (IT$3$), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance.<n>Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.
arXiv Detail & Related papers (2024-10-05T15:39:51Z) - Test-Time Training on Graphs with Large Language Models (LLMs) [68.375487369596]
Test-Time Training (TTT) has been proposed as a promising approach to train Graph Neural Networks (GNNs)
Inspired by the great annotation ability of Large Language Models (LLMs) on Text-Attributed Graphs (TAGs), we propose to enhance the test-time training on graphs with LLMs as annotators.
A two-stage training strategy is designed to tailor the test-time model with the limited and noisy labels.
arXiv Detail & Related papers (2024-04-21T08:20:02Z) - TEA: Test-time Energy Adaptation [67.4574269851666]
Test-time adaptation (TTA) aims to improve model generalizability when test data diverges from training distribution.
We propose a novel energy-based perspective, enhancing the model's perception of target data distributions.
arXiv Detail & Related papers (2023-11-24T10:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.