Related papers: Core-sets for Fair and Diverse Data Summarization

Core-sets for Fair and Diverse Data Summarization

URL: http://arxiv.org/abs/2310.08122v1
Date: Thu, 12 Oct 2023 08:24:02 GMT
Title: Core-sets for Fair and Diverse Data Summarization
Authors: Sepideh Mahabadi and Stojan Trajanovski
Abstract summary: We study core-set construction algorithms for the task of Diversity Maximization under fairness/ partition constraint. We show the first constant factor core-set w.r.t. sum-of-pairwise distances whose size is independent of the size of the dataset and the aspect ratio. In particular, we apply constrained diversity to summarize a set of timed messages that takes into account the messages' recency.
Score: 5.424775372322808
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study core-set construction algorithms for the task of Diversity Maximization under fairness/partition constraint. Given a set of points $P$ in a metric space partitioned into $m$ groups, and given $k_1,\ldots,k_m$, the goal of this problem is to pick $k_i$ points from each group $i$ such that the overall diversity of the $k=\sum_i k_i$ picked points is maximized. We consider two natural diversity measures: sum-of-pairwise distances and sum-of-nearest-neighbor distances, and show improved core-set construction algorithms with respect to these measures. More precisely, we show the first constant factor core-set w.r.t. sum-of-pairwise distances whose size is independent of the size of the dataset and the aspect ratio. Second, we show the first core-set w.r.t. the sum-of-nearest-neighbor distances. Finally, we run several experiments showing the effectiveness of our core-set approach. In particular, we apply constrained diversity maximization to summarize a set of timed messages that takes into account the messages' recency. Specifically, the summary should include more recent messages compared to older ones. This is a real task in one of the largest communication platforms, affecting the experience of hundreds of millions daily active users. By utilizing our core-set method for this task, we achieve a 100x speed-up while losing the diversity by only a few percent. Moreover, our approach allows us to improve the space usage of the algorithm in the streaming setting.

Related papers

Better Rates for Random Task Orderings in Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. We develop novel last-iterate bounds in the realizable least squares setup, and apply them to derive new results for continual learning. We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Scalable Private Partition Selection via Adaptive Weighting [66.09199304818928]
In a private set union, users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy. We propose an algorithm for this problem, MaximumDegree (MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight.
arXiv Detail & Related papers (2025-02-13T01:27:11Z)
Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z)
Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights [16.120911591795295]
In some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a subset of points, referred as facilities or suppliers. In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group.
arXiv Detail & Related papers (2024-10-16T18:00:19Z)
GIST: Greedy Independent Set Thresholding for Diverse Data Summarization [21.69260104523751]
We introduce a novel subset selection problem called min-distance diversification with monotone submodular utility ($textsfMDMS$) Given a set of points in a metric space, the goal of $textsfMDMS$ is to maximize an objective function combining a monotone submodular utility term and a min-distance diversity term between any pair of selected points.
arXiv Detail & Related papers (2024-05-29T04:39:24Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Composable Coresets for Determinant Maximization: Greedy is Almost Optimal [2.849878266188877]
Given a set of $n$ in $mathbbRd$, the goal of the emphdeterminant problem is to pick $k$ vectors with the maximum volume. We show that the widely-used Greedy algorithm also provides composable coresets with an almost optimal approximation factor of $O(k)3k$.
arXiv Detail & Related papers (2023-09-26T21:46:44Z)
Computing Star Discrepancies with Numerical Black-Box Optimization Algorithms [56.08144272945755]
We compare 8 popular numerical black-box optimization algorithms on the $L_infty$ star discrepancy problem. We show that all used solvers perform very badly on a large majority of the instances. We suspect that state-of-the-art numerical black-box optimization techniques fail to capture the global structure of the problem.
arXiv Detail & Related papers (2023-06-29T14:57:56Z)
Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms [17.57585822765145]
We propose an exact algorithm that is suitable for small datasets as well as a $frac1-varepsilon integer integer5$-approximation algorithm for any $varepsilon in (0, 1)$ that scales to large datasets. Experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.
arXiv Detail & Related papers (2023-01-05T13:02:35Z)
Robust Multi-Object Tracking by Marginal Inference [92.48078680697311]
Multi-object tracking in videos requires to solve a fundamental problem of one-to-one assignment between objects in adjacent frames. We present an efficient approach to compute a marginal probability for each pair of objects in real time. It achieves competitive results on MOT17 and MOT20 benchmarks.
arXiv Detail & Related papers (2022-08-07T14:04:45Z)
Streaming Algorithms for Diversity Maximization with Fairness Constraints [4.53279507109072]
A streaming algorithm should process $X$ sequentially in one pass and a return subset with maximum emph orders while guaranteeing the fairness constraint. Since diversity is NP-hard in general, we propose two algorithms for fair diversity in data streams. Experimental results show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms.
arXiv Detail & Related papers (2022-07-30T11:47:31Z)
Introduction to Core-sets: an Updated Survey [18.059360820527687]
In machine learning problems, the goal is to minimize or maximize an objective function over some space of candidate solutions. Traditional algorithms cannot handle modern systems that require parallel real-time computations of infinite distributed streams. This survey summarizes such constructions in a retrospective way, that aims to unified and simplify the state-of-the-art.
arXiv Detail & Related papers (2020-11-18T16:31:34Z)
Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems. Our results imply an improved algorithm for the Sample and Aggregate privacy framework. One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z)
Revisiting Modified Greedy Algorithm for Monotone Submodular Maximization with a Knapsack Constraint [75.85952446237599]
We show that a modified greedy algorithm can achieve an approximation factor of $0.305$. We derive a data-dependent upper bound on the optimum. It can also be used to significantly improve the efficiency of such algorithms as branch and bound.
arXiv Detail & Related papers (2020-08-12T15:40:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.