Optimal Sampling Gaps for Adaptive Submodular Maximization
- URL: http://arxiv.org/abs/2104.01750v1
- Date: Mon, 5 Apr 2021 03:21:32 GMT
- Title: Optimal Sampling Gaps for Adaptive Submodular Maximization
- Authors: Shaojie Tang, Jing Yuan
- Abstract summary: We study the performance loss caused by probability sampling in the context of adaptive submodular.
We show that the property of policywise submodular can be found in a wide range of real-world applications.
- Score: 28.24164217929491
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Running machine learning algorithms on large and rapidly growing volumes of
data are often computationally expensive, one common trick to reduce the size
of a data set, and thus reduce the computational cost of machine learning
algorithms, is \emph{probability sampling}. It creates a sampled data set by
including each data point from the original data set with a known probability.
Although the benefit of running machine learning algorithms on the reduced data
set is obvious, one major concern is that the performance of the solution
obtained from samples might be much worse than that of the optimal solution
when using the full data set. In this paper, we examine the performance loss
caused by probability sampling in the context of adaptive submodular
maximization. We consider a easiest probability sampling method which selects
each data point independently with probability $r\in[0,1]$. We define sampling
gap as the largest ratio of the optimal solution obtained from the full data
set and the optimal solution obtained from the samples, over independence
systems. Our main contribution is to show that if the utility function is
policywise submodular, then for a given sampling rate $r$, the sampling gap is
both upper bounded and lower bounded by $1/r$. One immediate implication of our
result is that if we can find an $\alpha$-approximation solution based on a
sampled data set (which is sampled at sampling rate $r$), then this solution
achieves an $\alpha r$ approximation ratio for the original problem when using
the full data set. We also show that the property of policywise submodular can
be found in a wide range of real-world applications, including pool-based
active learning and adaptive viral marketing.
Related papers
- Data-Efficient Learning via Clustering-Based Sensitivity Sampling:
Foundation Models and Beyond [28.651041302245538]
We present a new data selection approach based on $k$-means clustering and sampling sensitivity.
We show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling.
arXiv Detail & Related papers (2024-02-27T09:03:43Z) - Optimize-via-Predict: Realizing out-of-sample optimality in data-driven
optimization [0.0]
We examine a formulation for data-driven optimization wherein the decision-maker is not privy to the true distribution.
We define a prescriptive solution as a decisionvendor rule mapping such a data set to decisions.
We present an optimization problem that would solve for such an out-of-sample optimal solution, and does so efficiently by a combination of sampling and bisection search algorithms.
arXiv Detail & Related papers (2023-09-20T08:48:50Z) - Multisample Flow Matching: Straightening Flows with Minibatch Couplings [38.82598694134521]
Simulation-free methods for training continuous-time generative models construct probability paths that go between noise distributions and individual data samples.
We propose Multisample Flow Matching, a more general framework that uses non-trivial couplings between data and noise samples.
We show that our proposed methods improve sample consistency on downsampled ImageNet data sets, and lead to better low-cost sample generation.
arXiv Detail & Related papers (2023-04-28T11:33:08Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Sampling from a $k$-DPP without looking at all items [58.30573872035083]
Given a kernel function and a subset size $k$, our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. $k$-DPP)
Existing $k$-DPP sampling algorithms require an expensive preprocessing step which involves multiple passes over all $n$ items, making it infeasible for large datasets.
We develop an algorithm which adaptively builds a sufficiently large uniform sample of data that is then used to efficiently generate a smaller set of $k$ items.
arXiv Detail & Related papers (2020-06-30T16:40:44Z) - Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators
with Massive Data [20.79270369203348]
Existing methods mostly focus on subsampling with replacement due to its high computational efficiency.
We first derive optimal subsampling probabilities in the context of quasi-likelihood estimation.
We develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data.
arXiv Detail & Related papers (2020-05-21T02:46:56Z) - Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$.
Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.