Related papers: Sampling from a $k$-DPP without looking at all items

Sampling from a $k$-DPP without looking at all items

URL: http://arxiv.org/abs/2006.16947v1
Date: Tue, 30 Jun 2020 16:40:44 GMT
Title: Sampling from a $k$-DPP without looking at all items
Authors: Daniele Calandriello, Micha{\l} Derezi\'nski, Michal Valko
Abstract summary: Given a kernel function and a subset size $k$, our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. $k$-DPP) Existing $k$-DPP sampling algorithms require an expensive preprocessing step which involves multiple passes over all $n$ items, making it infeasible for large datasets. We develop an algorithm which adaptively builds a sufficiently large uniform sample of data that is then used to efficiently generate a smaller set of $k$ items.
Score: 58.30573872035083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more. Given a kernel function and a subset size $k$, our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. $k$-DPP). Existing $k$-DPP sampling algorithms require an expensive preprocessing step which involves multiple passes over all $n$ items, making it infeasible for large datasets. A na\"ive heuristic addressing this problem is to uniformly subsample a fraction of the data and perform $k$-DPP sampling only on those items, however this method offers no guarantee that the produced sample will even approximately resemble the target distribution over the original dataset. In this paper, we develop an algorithm which adaptively builds a sufficiently large uniform sample of data that is then used to efficiently generate a smaller set of $k$ items, while ensuring that this set is drawn exactly from the target distribution defined on all $n$ items. We show empirically that our algorithm produces a $k$-DPP sample after observing only a small fraction of all elements, leading to several orders of magnitude faster performance compared to the state-of-the-art.

Related papers

Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner. We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z)
Perfect Sampling from Pairwise Comparisons [26.396901523831534]
We study how to efficiently obtain perfect samples from a discrete distribution $mathcalD$ given access only to pairwise comparisons of elements of its support. We design a Markov chain whose stationary distribution coincides with $mathcalD$ and give an algorithm to obtain exact samples using the technique of Coupling from the Past.
arXiv Detail & Related papers (2022-11-23T11:20:30Z)
A Faster Sampler for Discrete Determinantal Point Processes [10.355938901584567]
Discrete Determinantal Point Processes (DPPs) have a wide array of potential applications for subsampling datasets. In the worst-case scenario, the sampling cost scales as $O(n3)$ where n is the number of elements of the ground set. A popular workaround to this prohibitive cost is to sample DPPs defined by low-rank kernels.
arXiv Detail & Related papers (2022-10-31T14:37:54Z)
Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes [45.40646054226403]
A determinantal point process (DPP) assigns a probability to every subset of $n$ items. Recent work has studied an approximate chain Monte Carlo sampling algorithm for NDPPs restricted to size-$k$ subsets. We develop a scalable MCMC sampling algorithm for $k$-NDPPs with low-rank kernels.
arXiv Detail & Related papers (2022-07-01T15:22:12Z)
Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time [54.65688986250061]
We give an input sparsity time sampling algorithm for approximating the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices. Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time.
arXiv Detail & Related papers (2022-02-09T15:26:03Z)
Scalable Sampling for Nonsymmetric Determinantal Point Processes [47.18450267217498]
A determinantal point process (DPP) on a collection of $M$ items is a model, parameterized by a symmetric kernel matrix. Recent work shows that removing the kernel symmetry constraint, yielding nonsymmetric DPPs (NDPPs), can lead to significant predictive performance gains for machine learning applications. There is only one known DPP sampling algorithm, based on Cholesky decomposition, that can directly apply to NDPPs as well. We show that imposing certain structural constraints on the NDPP kernel enables us to bound the rejection rate in a way that depends only on the kernel rank.
arXiv Detail & Related papers (2022-01-20T19:21:59Z)
Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z)
Testing Determinantal Point Processes [0.0]
We investigate DPPs from a new perspective: property testing of distributions. We propose the first algorithm for testing DPPs. We show a new hardness result for the problem of testing the more general class of log-submodular distributions.
arXiv Detail & Related papers (2020-08-09T04:45:16Z)
A Provably Efficient Sample Collection Strategy for Reinforcement Learning [123.69175280309226]
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. We propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., sparse simulator of the environment); 2) An "objective-agnostic" sample collection responsible for generating the prescribed samples as fast as possible.
arXiv Detail & Related papers (2020-07-13T15:17:35Z)
Best-item Learning in Random Utility Models with Subset Choices [40.17224226373741]
We consider the problem of PAC learning the most valuable item from a pool of $n$ items using sequential, adaptively chosen plays of subsets of $k$ items. We identify a new property of such a RUM, termed the minimum advantage, that helps in characterizing the complexity of separating pairs of items. We give a learning algorithm for general RUMs, based on pairwise relative counts of items and hierarchical elimination, along with a new PAC sample complexity guarantee.
arXiv Detail & Related papers (2020-02-19T03:57:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.