Incomplete U-Statistics of Equireplicate Designs: Berry-Esseen Bound and Efficient Construction
- URL: http://arxiv.org/abs/2510.20755v2
- Date: Fri, 24 Oct 2025 15:03:09 GMT
- Title: Incomplete U-Statistics of Equireplicate Designs: Berry-Esseen Bound and Efficient Construction
- Authors: Cesare Miglioli, Jordan Awan,
- Abstract summary: This paper presents a novel perspective on U-statistics, grounded in hypergraph theory and designs.<n>By characterizing the dependence structure of a U-statistic, we derive a Berry-Esseen bound valid for incomplete U-statistics.<n>We also introduce efficient algorithms to construct incomplete U-statistics of equireplicate designs.
- Score: 4.123234066624863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: U-statistics are a fundamental class of estimators that generalize the sample mean and underpin much of nonparametric statistics. Although extensively studied in both statistics and probability, key challenges remain: their high computational cost - addressed partly through incomplete U-statistics - and their non-standard asymptotic behavior in the degenerate case, which typically requires resampling methods for hypothesis testing. This paper presents a novel perspective on U-statistics, grounded in hypergraph theory and combinatorial designs. Our approach bypasses the traditional Hoeffding decomposition, the main analytical tool in this literature but one highly sensitive to degeneracy. By characterizing the dependence structure of a U-statistic, we derive a Berry-Esseen bound valid for incomplete U-statistics of deterministic designs, yielding conditions under which Gaussian limiting distributions can be established even in degenerate cases and when the order diverges. We also introduce efficient algorithms to construct incomplete U-statistics of equireplicate designs, a subclass of deterministic designs that, in certain cases, achieve minimum variance. Finally, we apply our framework to kernel-based tests that use Maximum Mean Discrepancy (MMD) and Hilbert-Schmidt Independence Criterion. In a real data example with the CIFAR-10 dataset, our permutation-free MMD test delivers substantial computational gains while retaining power and type I error control.
Related papers
- Likelihood-Preserving Embeddings for Statistical Inference [0.0]
Modern machine learning embeddings provide powerful compression of high-dimensional data.<n>This paper develops a theory of likelihood-preserving embeddings.<n>Experiments on Gaussian and Cauchy distributions validate the sharp phase transition predicted by exponential family theory.
arXiv Detail & Related papers (2025-12-27T16:21:55Z) - Statistical Inference under Adaptive Sampling with LinUCB [15.167069362020426]
We show that the linear upper confidence bound (LinUCB) algorithm for linear bandits satisfies a property called stability.<n>We establish a central limit theorem for the LinUCB algorithm, establishing normality for the limiting distribution of the estimation error.
arXiv Detail & Related papers (2025-11-28T21:48:18Z) - A Unified Framework for Inference with General Missingness Patterns and Machine Learning Imputation [12.817707155207817]
This paper develops a novel method which delivers valid statistical inference framework for general Z-estimation problems.<n>We provide theoretical guarantees of normality of the proposed estimator and efficiency dominance over weighted complete-case analyses.
arXiv Detail & Related papers (2025-08-21T01:59:59Z) - A Sample Efficient Conditional Independence Test in the Presence of Discretization [54.047334792855345]
Conditional Independence (CI) tests directly to discretized data can lead to incorrect conclusions.<n>Recent advancements have sought to infer the correct CI relationship between the latent variables through binarizing observed data.<n>Motivated by this, this paper introduces a sample-efficient CI test that does not rely on the binarization process.
arXiv Detail & Related papers (2025-06-10T12:41:26Z) - CLT and Edgeworth Expansion for m-out-of-n Bootstrap Estimators of The Studentized Median [4.174296652683762]
The m-out-of-n bootstrap approximates the distribution of a statistic by repeatedly drawing m subsamples without replacement from an original sample of size n.<n>Despite its broad applicability across econometrics, biostatistics, and machine learning, rigorous parameter-free guarantees for the soundness of the bootstrap have remained elusive.<n>This paper establishes such guarantees by analyzing the estimator of sample quantiles obtained from m-out-of-n resampling of a dataset of size n.
arXiv Detail & Related papers (2025-05-16T22:14:49Z) - High-dimensional logistic regression with missing data: Imputation, regularization, and universality [7.167672851569787]
We study high-dimensional, ridge-regularized logistic regression.
We provide exact characterizations of both the prediction error and the estimation error.
arXiv Detail & Related papers (2024-10-01T21:41:21Z) - Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods [59.779795063072655]
Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems.
We analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity.
arXiv Detail & Related papers (2024-08-25T04:07:18Z) - Selective Nonparametric Regression via Testing [54.20569354303575]
We develop an abstention procedure via testing the hypothesis on the value of the conditional variance at a given point.
Unlike existing methods, the proposed one allows to account not only for the value of the variance itself but also for the uncertainty of the corresponding variance predictor.
arXiv Detail & Related papers (2023-09-28T13:04:11Z) - A High-dimensional Convergence Theorem for U-statistics with
Applications to Kernel-based Testing [3.469038201881982]
We prove a convergence theorem for U-statistics of degree two, where the data dimension $d$ is allowed to scale with sample size $n$.
We apply our theory to two popular kernel-based distribution tests, MMD and KSD, whose high-dimensional performance has been challenging to study.
arXiv Detail & Related papers (2023-02-11T12:49:46Z) - Nonparametric Conditional Local Independence Testing [69.31200003384122]
Conditional local independence is an independence relation among continuous time processes.
No nonparametric test of conditional local independence has been available.
We propose such a nonparametric test based on double machine learning.
arXiv Detail & Related papers (2022-03-25T10:31:02Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Instability, Computational Efficiency and Statistical Accuracy [101.32305022521024]
We develop a framework that yields statistical accuracy based on interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (instability) when applied to an empirical object based on $n$ samples.
We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models.
arXiv Detail & Related papers (2020-05-22T22:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.