Related papers: Universal priors: solving empirical Bayes via Bayesian inference and pretraining

Universal priors: solving empirical Bayes via Bayesian inference and pretraining

URL: http://arxiv.org/abs/2602.15136v1
Date: Mon, 16 Feb 2026 19:29:27 GMT
Title: Universal priors: solving empirical Bayes via Bayesian inference and pretraining
Authors: Nick Cannella, Anzo Teh, Yanjun Han, Yury Polyanskiy,
Abstract summary: A transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems.<n>We ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions.
Score: 25.835876583903282
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of $\widetilde{O}(\frac{1}{n})$ uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generalized posterior.

Related papers

Calibrated Test-Time Guidance for Bayesian Inference [25.653139110512914]
We show that common test-time guidance methods do not recover the correct posterior distribution and identify the structural approximations responsible for this failure.<n>We then propose consistent alternative estimators that enable sampling from the Bayesian posterior.<n>We significantly outperform previous methods on a set of Bayesian inference tasks, and match state-of-the-art in black hole image reconstruction.
arXiv Detail & Related papers (2026-02-25T21:38:47Z)
Analyzing Generalization in Pre-Trained Symbolic Regression [17.789199791229624]
Symbolic regression algorithms search a space of mathematical expressions for formulas that explain given data.<n> Transformer-based models have emerged as a promising, promising approach shifting the expensive search to a large-scale pre-training phase.
arXiv Detail & Related papers (2025-09-24T07:47:02Z)
BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition [78.70453964041718]
Current deep learning algorithms usually solve for the optimal classifier by emphimplicitly estimating the posterior probabilities.<n>This simple methodology has been proven effective for meticulously balanced academic benchmark datasets.<n>However, it is not applicable to the long-tailed data distributions in the real world.<n>This paper presents a novel approach (BAPE) that provides a more precise theoretical estimation of the data distributions.
arXiv Detail & Related papers (2025-06-29T15:12:50Z)
How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization [23.759737527800585]
We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks.<n>We show that the trained transformers exhibit out-of-distribution generalization, i.e., generalizing to samples outside of the population distribution.
arXiv Detail & Related papers (2025-05-21T01:26:44Z)
A Classical View on Benign Overfitting: The Role of Sample Size [14.36840959836957]
We focus on almost benign overfitting, where models simultaneously achieve both arbitrarily small training and test errors.<n>This behavior is characteristic of neural networks, which often achieve low (but non-zero) training error while still generalizing well.
arXiv Detail & Related papers (2025-05-16T18:37:51Z)
Generalizing to any diverse distribution: uniformity, gentle finetuning and rebalancing [55.791818510796645]
We aim to develop models that generalize well to any diverse test distribution, even if the latter deviates significantly from the training data. Various approaches like domain adaptation, domain generalization, and robust optimization attempt to address the out-of-distribution challenge. We adopt a more conservative perspective by accounting for the worst-case error across all sufficiently diverse test distributions within a known domain.
arXiv Detail & Related papers (2024-10-08T12:26:48Z)
Variational Prediction [95.00085314353436]
We present a technique for learning a variational approximation to the posterior predictive distribution using a variational bound. This approach can provide good predictive distributions without test time marginalization costs.
arXiv Detail & Related papers (2023-07-14T18:19:31Z)
Distribution Shift Inversion for Out-of-Distribution Prediction [57.22301285120695]
We propose a portable Distribution Shift Inversion algorithm for Out-of-Distribution (OoD) prediction. We show that our method provides a general performance gain when plugged into a wide range of commonly used OoD algorithms.
arXiv Detail & Related papers (2023-06-14T08:00:49Z)
Semantic Self-adaptation: Enhancing Generalization with a Single Sample [45.111358665370524]
We propose a self-adaptive approach for semantic segmentation. It fine-tunes the parameters of convolutional layers to the input image using consistency regularization. Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time.
arXiv Detail & Related papers (2022-08-10T12:29:01Z)
Sample-Efficient Optimisation with Probabilistic Transformer Surrogates [66.98962321504085]
This paper investigates the feasibility of employing state-of-the-art probabilistic transformers in Bayesian optimisation. We observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. We introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance.
arXiv Detail & Related papers (2022-05-27T11:13:17Z)
Posterior concentration and fast convergence rates for generalized Bayesian learning [4.186575888568896]
We study the learning rate of generalized Bayes estimators in a general setting. We prove that under the multi-scale Bernstein's condition, the generalized posterior distribution concentrates around the set of optimal hypotheses.
arXiv Detail & Related papers (2021-11-19T14:25:21Z)
Convergence Rates of Empirical Bayes Posterior Distributions: A Variational Perspective [20.51199643121034]
We study the convergence rates of empirical Bayes posterior distributions for nonparametric and high-dimensional inference. We show that the empirical Bayes posterior distribution induced by the maximum marginal likelihood estimator can be regarded as a variational approximation to a hierarchical Bayes posterior distribution.
arXiv Detail & Related papers (2020-09-08T19:35:27Z)
Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design. A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift. Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.