Related papers: How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure

How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure

URL: http://arxiv.org/abs/2512.23109v1
Date: Sun, 28 Dec 2025 23:16:22 GMT
Title: How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure
Authors: Paul M. Thompson,
Abstract summary: Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support.<n>Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations.<n>We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes?
Score: 1.560394526607184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support, where predicted probabilities must be both accurate and well calibrated. Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations, rather than only on average-a critical issue in biomedicine, where rare conditions and specific groups can exhibit large errors even when overall loss is low. We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes? Rather than analyzing arbitrary parameterizations, we focus on induced families of classifiers obtained by varying prompts or semantic embeddings within a restricted representation space. When model outputs depend smoothly on a low-dimensional semantic representation-an assumption supported by spectral structure in text and joint image-text embeddings-classical uniform convergence tools yield meaningful non-asymptotic guarantees. Our main results give finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings. The implied sample complexity depends on intrinsic/effective dimension, not ambient embedding dimension, and we further derive spectrum-dependent bounds that make explicit how eigenvalue decay governs data requirements. We conclude with implications for data-limited biomedical modeling, including when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration.

Related papers

SIGMA: Scalable Spectral Insights for LLM Collapse [51.863164847253366]
We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
arXiv Detail & Related papers (2026-01-06T19:47:11Z)
Gradient Boosted Mixed Models: Flexible Joint Estimation of Mean and Variance Components for Clustered Data [0.6361348748202732]
Gradient Boosted Mixed Models (GBMixed) is a framework and algorithm that extends boosting to jointly estimate mean and variance components.<n> Simulations and real-world applications demonstrate accurate recovery of variance components, calibrated prediction intervals, and improved predictive accuracy.
arXiv Detail & Related papers (2025-10-31T19:28:26Z)
Robust Molecular Property Prediction via Densifying Scarce Labeled Data [53.24886143129006]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.
arXiv Detail & Related papers (2025-06-13T15:27:40Z)
Generalization error of min-norm interpolators in transfer learning [2.7309692684728617]
Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. In many applications, a limited amount of test data may be available during training, yet properties of min-norm in this setting are not well-understood. We establish a novel anisotropic local law to achieve these characterizations.
arXiv Detail & Related papers (2024-06-20T02:23:28Z)
Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression [47.1405366895538]
Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice. This paper proposes estimating high-dimensional heteroscedastic linear regression models using a heteroscedastic partitioned empirical Bayes Expectation Conditional Maximization algorithm.
arXiv Detail & Related papers (2023-09-15T22:06:29Z)
Variational Classification [51.2541371924591]
We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders. Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency. We induce a chosen latent distribution, instead of the implicit assumption found in a standard softmax layer.
arXiv Detail & Related papers (2023-05-17T17:47:19Z)
An Epistemic and Aleatoric Decomposition of Arbitrariness to Constrain the Set of Good Models [7.620967781722717]
Recent research reveals that machine learning (ML) models are highly sensitive to minor changes in their training procedure.<n>We show that stability decomposes into epistemic and aleatoric components, capturing the consistency and confidence in prediction.<n>We propose a model selection procedure that includes epistemic and aleatoric criteria alongside existing accuracy and fairness criteria, and show that it successfully narrows down a large set of good models.
arXiv Detail & Related papers (2023-02-09T09:35:36Z)
TACTiS: Transformer-Attentional Copulas for Time Series [76.71406465526454]
estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance. We propose a versatile method that estimates joint distributions using an attention-based decoder. We show that our model produces state-of-the-art predictions on several real-world datasets.
arXiv Detail & Related papers (2022-02-07T21:37:29Z)
Optimal regularizations for data generation with probabilistic graphical models [0.0]
Empirically, well-chosen regularization schemes dramatically improve the quality of the inferred models. We consider the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models.
arXiv Detail & Related papers (2021-12-02T14:45:16Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
Asymptotic Analysis of an Ensemble of Randomly Projected Linear Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets. We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator. We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
Is completeness necessary? Estimation in nonidentified linear models [0.0]
We develop a theory of regularized estimators, which include methods such as high-dimensional ridge regularization, gradient descent, and principal component analysis (PCA)<n>The results are illustrated for high-dimensional and nonparametric instrumental variable regressions and are supported through simulation experiments.
arXiv Detail & Related papers (2017-09-11T16:49:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.