On the Subbagging Estimation for Massive Data
- URL: http://arxiv.org/abs/2103.00631v1
- Date: Sun, 28 Feb 2021 21:38:22 GMT
- Title: On the Subbagging Estimation for Massive Data
- Authors: Tao Zou, Xian Li, Xuan Liang, Hansheng Wang
- Abstract summary: This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers.
For the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each subsample with a subsample size $k_Nll N$ to meet the memory constraint is sampled uniformly without replacement.
An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.
- Score: 10.902757578215255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article introduces subbagging (subsample aggregating) estimation
approaches for big data analysis with memory constraints of computers.
Specifically, for the whole dataset with size $N$, $m_N$ subsamples are
randomly drawn, and each subsample with a subsample size $k_N\ll N$ to meet the
memory constraint is sampled uniformly without replacement. Aggregating the
estimators of $m_N$ subsamples can lead to subbagging estimation. To analyze
the theoretical properties of the subbagging estimator, we adapt the incomplete
$U$-statistics theory with an infinite order kernel to allow overlapping drawn
subsamples in the sampling procedure. Utilizing this novel theoretical
framework, we demonstrate that via a proper hyperparameter selection of $k_N$
and $m_N$, the subbagging estimator can achieve $\sqrt{N}$-consistency and
asymptotic normality under the condition $(k_Nm_N)/N\to \alpha \in (0,\infty]$.
Compared to the full sample estimator, we theoretically show that the
$\sqrt{N}$-consistent subbagging estimator has an inflation rate of $1/\alpha$
in its asymptotic variance. Simulation experiments are presented to demonstrate
the finite sample performances. An American airline dataset is analyzed to
illustrate that the subbagging estimate is numerically close to the full sample
estimate, and can be computationally fast under the memory constraint.
Related papers
- Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
We show that for a broad class of data structures their bounds cannot be significantly improved.
This is a novel emphstatistical-computational trade-off for density estimation.
arXiv Detail & Related papers (2024-10-30T15:03:33Z) - Precise Asymptotics of Bagging Regularized M-estimators [5.165142221427928]
We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators.
Key to our analysis is a new result on the joint behavior of correlations between the estimator and residual errors on overlapping subsamples.
Joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data.
arXiv Detail & Related papers (2024-09-23T17:48:28Z) - Variance Reduction for the Independent Metropolis Sampler [11.074080383657453]
We prove that if $pi$ is close enough under KL divergence to another density $q$, an independent sampler that obtains samples from $pi$ achieves smaller variance than i.i.d. sampling from $pi$.
We propose an adaptive independent Metropolis algorithm that adapts the proposal density such that its KL divergence with the target is being reduced.
arXiv Detail & Related papers (2024-06-25T16:38:53Z) - Convergence Analysis of Probability Flow ODE for Score-based Generative Models [5.939858158928473]
We study the convergence properties of deterministic samplers based on probability flow ODEs from both theoretical and numerical perspectives.
We prove the total variation between the target and the generated data distributions can be bounded above by $mathcalO(d3/4delta1/2)$ in the continuous time level.
arXiv Detail & Related papers (2024-04-15T12:29:28Z) - Sparse PCA with Oracle Property [115.72363972222622]
We propose a family of estimators based on the semidefinite relaxation of sparse PCA with novel regularizations.
We prove that, another estimator within the family achieves a sharper statistical rate of convergence than the standard semidefinite relaxation of sparse PCA.
arXiv Detail & Related papers (2023-12-28T02:52:54Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Maximum sampled conditional likelihood for informative subsampling [4.708378681950648]
Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited.
We propose to use the maximum maximum conditional likelihood estimator (MSCLE) based on the sampled data.
arXiv Detail & Related papers (2020-11-11T16:01:17Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Error bounds in estimating the out-of-sample prediction error using
leave-one-out cross validation in high-dimensions [19.439945058410203]
We study the problem of out-of-sample risk estimation in the high dimensional regime.
Extensive empirical evidence confirms the accuracy of leave-one-out cross validation.
One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.
arXiv Detail & Related papers (2020-03-03T20:07:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.