On Distributed Larger-Than-Memory Subset Selection With Pairwise
Submodular Functions
- URL: http://arxiv.org/abs/2402.16442v1
- Date: Mon, 26 Feb 2024 09:38:39 GMT
- Title: On Distributed Larger-Than-Memory Subset Selection With Pairwise
Submodular Functions
- Authors: Maximilian B\"other, Abraham Sebastian, Pranjal Awasthi, Ana Klimovic,
Srikumar Ramalingam
- Abstract summary: Submodularity, a discrete analogue of convexity, is commonly used for solving subset selection problems.
In this paper, we propose a novel distributed bounding algorithm with provable approximation guarantees.
We show that these algorithms find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods.
- Score: 32.09166873008287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many learning problems hinge on the fundamental problem of subset selection,
i.e., identifying a subset of important and representative points. For example,
selecting the most significant samples in ML training cannot only reduce
training costs but also enhance model quality. Submodularity, a discrete
analogue of convexity, is commonly used for solving subset selection problems.
However, existing algorithms for optimizing submodular functions are
sequential, and the prior distributed methods require at least one central
machine to fit the target subset. In this paper, we relax the requirement of
having a central machine for the target subset by proposing a novel distributed
bounding algorithm with provable approximation guarantees. The algorithm
iteratively bounds the minimum and maximum utility values to select high
quality points and discard the unimportant ones. When bounding does not find
the complete subset, we use a multi-round, partition-based distributed greedy
algorithm to identify the remaining subset. We show that these algorithms find
high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in
quality compared to centralized methods, and scale to a dataset with 13 billion
points.
Related papers
- Training Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization [52.80408805368928]
We introduce a novel greedy-style subset selection algorithm for batch acquisition.
Our experiments on the red fluorescent proteins show that our proposed method achieves the baseline performance in 1.69x fewer queries.
arXiv Detail & Related papers (2024-06-21T05:57:08Z) - Multi-objective Binary Coordinate Search for Feature Selection [0.24578723416255746]
We propose the binary multi-objective coordinate search (MOCS) algorithm to solve large-scale feature selection problems.
Results indicate the significant superiority of our method over NSGA-II, on five real-world large-scale datasets.
arXiv Detail & Related papers (2024-02-20T00:50:26Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering [0.5801044612920815]
We present a new branch-and-bound algorithm for semi-supervised MSSC.
Background knowledge is incorporated as pairwise must-link and cannot-link constraints.
For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points.
arXiv Detail & Related papers (2021-11-30T17:08:53Z) - Distributed stochastic proximal algorithm with random reshuffling for
non-smooth finite-sum optimization [28.862321453597918]
Non-smooth finite-sum minimization is a fundamental problem in machine learning.
This paper develops a distributed proximal-gradient algorithm with random reshuffling to solve the problem.
arXiv Detail & Related papers (2021-11-06T07:29:55Z) - Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs)
Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals.
We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - Fast Greedy Subset Selection from Large Candidate Solution Sets in
Evolutionary Multi-objective Optimization [11.110675371854988]
We discuss the efficiency of greedy subset selection for the hypervolume, IGD and IGD+ indicators.
Our idea is to use the submodular property, which is known for the hypervolume indicator, to improve their efficiency.
arXiv Detail & Related papers (2021-02-01T16:14:15Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$.
Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.