Sampling from a $k$-DPP without looking at all items
- URL: http://arxiv.org/abs/2006.16947v1
- Date: Tue, 30 Jun 2020 16:40:44 GMT
- Title: Sampling from a $k$-DPP without looking at all items
- Authors: Daniele Calandriello, Micha{\l} Derezi\'nski, Michal Valko
- Abstract summary: Given a kernel function and a subset size $k$, our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. $k$-DPP)
Existing $k$-DPP sampling algorithms require an expensive preprocessing step which involves multiple passes over all $n$ items, making it infeasible for large datasets.
We develop an algorithm which adaptively builds a sufficiently large uniform sample of data that is then used to efficiently generate a smaller set of $k$ items.
- Score: 58.30573872035083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Determinantal point processes (DPPs) are a useful probabilistic model for
selecting a small diverse subset out of a large collection of items, with
applications in summarization, stochastic optimization, active learning and
more. Given a kernel function and a subset size $k$, our goal is to sample $k$
out of $n$ items with probability proportional to the determinant of the kernel
matrix induced by the subset (a.k.a. $k$-DPP). Existing $k$-DPP sampling
algorithms require an expensive preprocessing step which involves multiple
passes over all $n$ items, making it infeasible for large datasets. A na\"ive
heuristic addressing this problem is to uniformly subsample a fraction of the
data and perform $k$-DPP sampling only on those items, however this method
offers no guarantee that the produced sample will even approximately resemble
the target distribution over the original dataset. In this paper, we develop an
algorithm which adaptively builds a sufficiently large uniform sample of data
that is then used to efficiently generate a smaller set of $k$ items, while
ensuring that this set is drawn exactly from the target distribution defined on
all $n$ items. We show empirically that our algorithm produces a $k$-DPP sample
after observing only a small fraction of all elements, leading to several
orders of magnitude faster performance compared to the state-of-the-art.
Related papers
- Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - A Faster Sampler for Discrete Determinantal Point Processes [10.355938901584567]
Discrete Determinantal Point Processes (DPPs) have a wide array of potential applications for subsampling datasets.
In the worst-case scenario, the sampling cost scales as $O(n3)$ where n is the number of elements of the ground set.
A popular workaround to this prohibitive cost is to sample DPPs defined by low-rank kernels.
arXiv Detail & Related papers (2022-10-31T14:37:54Z) - Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes [45.40646054226403]
A determinantal point process (DPP) assigns a probability to every subset of $n$ items.
Recent work has studied an approximate chain Monte Carlo sampling algorithm for NDPPs restricted to size-$k$ subsets.
We develop a scalable MCMC sampling algorithm for $k$-NDPPs with low-rank kernels.
arXiv Detail & Related papers (2022-07-01T15:22:12Z) - Leverage Score Sampling for Tensor Product Matrices in Input Sparsity
Time [54.65688986250061]
We give an input sparsity time sampling algorithm for approximating the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices.
Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time.
arXiv Detail & Related papers (2022-02-09T15:26:03Z) - Scalable Sampling for Nonsymmetric Determinantal Point Processes [47.18450267217498]
A determinantal point process (DPP) on a collection of $M$ items is a model, parameterized by a symmetric kernel matrix.
Recent work shows that removing the kernel symmetry constraint, yielding nonsymmetric DPPs (NDPPs), can lead to significant predictive performance gains for machine learning applications.
There is only one known DPP sampling algorithm, based on Cholesky decomposition, that can directly apply to NDPPs as well.
We show that imposing certain structural constraints on the NDPP kernel enables us to bound the rejection rate in a way that depends only on the kernel rank.
arXiv Detail & Related papers (2022-01-20T19:21:59Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Testing Determinantal Point Processes [0.0]
We investigate DPPs from a new perspective: property testing of distributions.
We propose the first algorithm for testing DPPs.
We show a new hardness result for the problem of testing the more general class of log-submodular distributions.
arXiv Detail & Related papers (2020-08-09T04:45:16Z) - A Provably Efficient Sample Collection Strategy for Reinforcement
Learning [123.69175280309226]
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior.
We propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., sparse simulator of the environment); 2) An "objective-agnostic" sample collection responsible for generating the prescribed samples as fast as possible.
arXiv Detail & Related papers (2020-07-13T15:17:35Z) - Best-item Learning in Random Utility Models with Subset Choices [40.17224226373741]
We consider the problem of PAC learning the most valuable item from a pool of $n$ items using sequential, adaptively chosen plays of subsets of $k$ items.
We identify a new property of such a RUM, termed the minimum advantage, that helps in characterizing the complexity of separating pairs of items.
We give a learning algorithm for general RUMs, based on pairwise relative counts of items and hierarchical elimination, along with a new PAC sample complexity guarantee.
arXiv Detail & Related papers (2020-02-19T03:57:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.