Maximum sampled conditional likelihood for informative subsampling
- URL: http://arxiv.org/abs/2011.05988v3
- Date: Wed, 11 May 2022 20:24:20 GMT
- Title: Maximum sampled conditional likelihood for informative subsampling
- Authors: HaiYing Wang and Jae Kwang Kim
- Abstract summary: Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited.
We propose to use the maximum maximum conditional likelihood estimator (MSCLE) based on the sampled data.
- Score: 4.708378681950648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subsampling is a computationally effective approach to extract information
from massive data sets when computing resources are limited. After a subsample
is taken from the full data, most available methods use an inverse probability
weighted (IPW) objective function to estimate the model parameters. The IPW
estimator does not fully utilize the information in the selected subsample. In
this paper, we propose to use the maximum sampled conditional likelihood
estimator (MSCLE) based on the sampled data. We established the asymptotic
normality of the MSCLE and prove that its asymptotic variance covariance matrix
is the smallest among a class of asymptotically unbiased estimators, including
the IPW estimator. We further discuss the asymptotic results with the L-optimal
subsampling probabilities and illustrate the estimation procedure with
generalized linear models. Numerical experiments are provided to evaluate the
practical performance of the proposed method.
Related papers
- Bayesian Estimation and Tuning-Free Rank Detection for Probability Mass Function Tensors [17.640500920466984]
This paper presents a novel framework for estimating the joint PMF and automatically inferring its rank from observed data.
We derive a deterministic solution based on variational inference (VI) to approximate the posterior distributions of various model parameters. Additionally, we develop a scalable version of the VI-based approach by leveraging variational inference (SVI)
Experiments involving both synthetic data and real movie recommendation data illustrate the advantages of our VI and SVI-based methods in terms of estimation accuracy, automatic rank detection, and computational efficiency.
arXiv Detail & Related papers (2024-10-08T20:07:49Z) - Adaptive Online Bayesian Estimation of Frequency Distributions with Local Differential Privacy [0.4604003661048266]
We propose a novel approach for the adaptive and online estimation of the frequency distribution of a finite number of categories under the local differential privacy (LDP) framework.
The proposed algorithm performs Bayesian parameter estimation via posterior sampling and adapts the randomization mechanism for LDP based on the obtained posterior samples.
We provide a theoretical analysis showing that (i) the posterior distribution targeted by the algorithm converges to the true parameter even for approximate posterior sampling, and (ii) the algorithm selects the optimal subset with high probability if posterior sampling is performed exactly.
arXiv Detail & Related papers (2024-05-11T13:59:52Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Rethinking Collaborative Metric Learning: Toward an Efficient
Alternative without Negative Sampling [156.7248383178991]
Collaborative Metric Learning (CML) paradigm has aroused wide interest in the area of recommendation systems (RS)
We find that negative sampling would lead to a biased estimation of the generalization error.
Motivated by this, we propose an efficient alternative without negative sampling for CML named textitSampling-Free Collaborative Metric Learning (SFCML)
arXiv Detail & Related papers (2022-06-23T08:50:22Z) - Nonuniform Negative Sampling and Log Odds Correction with Rare Events
Data [15.696653979226113]
We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data.
We derive a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its variance.
Both theoretical and empirical results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2021-10-25T15:37:22Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Near-optimal inference in adaptive linear regression [60.08422051718195]
Even simple methods like least squares can exhibit non-normal behavior when data is collected in an adaptive manner.
We propose a family of online debiasing estimators to correct these distributional anomalies in at least squares estimation.
We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
arXiv Detail & Related papers (2021-07-05T21:05:11Z) - Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators
with Massive Data [20.79270369203348]
Existing methods mostly focus on subsampling with replacement due to its high computational efficiency.
We first derive optimal subsampling probabilities in the context of quasi-likelihood estimation.
We develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data.
arXiv Detail & Related papers (2020-05-21T02:46:56Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - SUMO: Unbiased Estimation of Log Marginal Probability for Latent
Variable Models [80.22609163316459]
We introduce an unbiased estimator of the log marginal likelihood and its gradients for latent variable models based on randomized truncation of infinite series.
We show that models trained using our estimator give better test-set likelihoods than a standard importance-sampling based approach for the same average computational cost.
arXiv Detail & Related papers (2020-04-01T11:49:30Z) - Asymptotic Analysis of Sampling Estimators for Randomized Numerical
Linear Algebra Algorithms [43.134933182911766]
We develop an analysis to derive the distribution of RandNLA sampling estimators for the least-squares problem.
We identify optimal sampling probabilities based on the Asymptotic Mean Squared Error (AMSE) and the Expected Asymptotic Mean Squared Error (EAMSE)
Our theoretical results clarify the role of leverage in the sampling process, and our empirical results demonstrate improvements over existing methods.
arXiv Detail & Related papers (2020-02-24T20:34:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.