Core-sets for Fair and Diverse Data Summarization
- URL: http://arxiv.org/abs/2310.08122v1
- Date: Thu, 12 Oct 2023 08:24:02 GMT
- Title: Core-sets for Fair and Diverse Data Summarization
- Authors: Sepideh Mahabadi and Stojan Trajanovski
- Abstract summary: We study core-set construction algorithms for the task of Diversity Maximization under fairness/ partition constraint.
We show the first constant factor core-set w.r.t. sum-of-pairwise distances whose size is independent of the size of the dataset and the aspect ratio.
In particular, we apply constrained diversity to summarize a set of timed messages that takes into account the messages' recency.
- Score: 5.424775372322808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study core-set construction algorithms for the task of Diversity
Maximization under fairness/partition constraint. Given a set of points $P$ in
a metric space partitioned into $m$ groups, and given $k_1,\ldots,k_m$, the
goal of this problem is to pick $k_i$ points from each group $i$ such that the
overall diversity of the $k=\sum_i k_i$ picked points is maximized. We consider
two natural diversity measures: sum-of-pairwise distances and
sum-of-nearest-neighbor distances, and show improved core-set construction
algorithms with respect to these measures. More precisely, we show the first
constant factor core-set w.r.t. sum-of-pairwise distances whose size is
independent of the size of the dataset and the aspect ratio. Second, we show
the first core-set w.r.t. the sum-of-nearest-neighbor distances. Finally, we
run several experiments showing the effectiveness of our core-set approach. In
particular, we apply constrained diversity maximization to summarize a set of
timed messages that takes into account the messages' recency. Specifically, the
summary should include more recent messages compared to older ones. This is a
real task in one of the largest communication platforms, affecting the
experience of hundreds of millions daily active users. By utilizing our
core-set method for this task, we achieve a 100x speed-up while losing the
diversity by only a few percent. Moreover, our approach allows us to improve
the space usage of the algorithm in the streaming setting.
Related papers
- Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights [16.120911591795295]
In some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a subset of points, referred as facilities or suppliers.
In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group.
arXiv Detail & Related papers (2024-10-16T18:00:19Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Composable Coresets for Determinant Maximization: Greedy is Almost
Optimal [2.849878266188877]
Given a set of $n$ in $mathbbRd$, the goal of the emphdeterminant problem is to pick $k$ vectors with the maximum volume.
We show that the widely-used Greedy algorithm also provides composable coresets with an almost optimal approximation factor of $O(k)3k$.
arXiv Detail & Related papers (2023-09-26T21:46:44Z) - Computing Star Discrepancies with Numerical Black-Box Optimization
Algorithms [56.08144272945755]
We compare 8 popular numerical black-box optimization algorithms on the $L_infty$ star discrepancy problem.
We show that all used solvers perform very badly on a large majority of the instances.
We suspect that state-of-the-art numerical black-box optimization techniques fail to capture the global structure of the problem.
arXiv Detail & Related papers (2023-06-29T14:57:56Z) - Max-Min Diversification with Fairness Constraints: Exact and
Approximation Algorithms [17.57585822765145]
We propose an exact algorithm that is suitable for small datasets as well as a $frac1-varepsilon integer integer5$-approximation algorithm for any $varepsilon in (0, 1)$ that scales to large datasets.
Experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.
arXiv Detail & Related papers (2023-01-05T13:02:35Z) - Robust Multi-Object Tracking by Marginal Inference [92.48078680697311]
Multi-object tracking in videos requires to solve a fundamental problem of one-to-one assignment between objects in adjacent frames.
We present an efficient approach to compute a marginal probability for each pair of objects in real time.
It achieves competitive results on MOT17 and MOT20 benchmarks.
arXiv Detail & Related papers (2022-08-07T14:04:45Z) - Streaming Algorithms for Diversity Maximization with Fairness
Constraints [4.53279507109072]
A streaming algorithm should process $X$ sequentially in one pass and a return subset with maximum emph orders while guaranteeing the fairness constraint.
Since diversity is NP-hard in general, we propose two algorithms for fair diversity in data streams.
Experimental results show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms.
arXiv Detail & Related papers (2022-07-30T11:47:31Z) - Introduction to Core-sets: an Updated Survey [18.059360820527687]
In machine learning problems, the goal is to minimize or maximize an objective function over some space of candidate solutions.
Traditional algorithms cannot handle modern systems that require parallel real-time computations of infinite distributed streams.
This survey summarizes such constructions in a retrospective way, that aims to unified and simplify the state-of-the-art.
arXiv Detail & Related papers (2020-11-18T16:31:34Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Revisiting Modified Greedy Algorithm for Monotone Submodular
Maximization with a Knapsack Constraint [75.85952446237599]
We show that a modified greedy algorithm can achieve an approximation factor of $0.305$.
We derive a data-dependent upper bound on the optimum.
It can also be used to significantly improve the efficiency of such algorithms as branch and bound.
arXiv Detail & Related papers (2020-08-12T15:40:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.