An Empirical Evaluation of $k$-Means Coresets
- URL: http://arxiv.org/abs/2207.00966v1
- Date: Sun, 3 Jul 2022 06:47:53 GMT
- Title: An Empirical Evaluation of $k$-Means Coresets
- Authors: Chris Schwiegelshohn and Omar Ali Sheikh-Omar
- Abstract summary: There is no work on comparing the quality of available $k$-means coresets.
We propose a benchmark for which we argue that computing coresets is challenging.
We conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice.
- Score: 4.45709593827781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coresets are among the most popular paradigms for summarizing data. In
particular, there exist many high performance coresets for clustering problems
such as $k$-means in both theory and practice. Curiously, there exists no work
on comparing the quality of available $k$-means coresets.
In this paper we perform such an evaluation. There currently is no algorithm
known to measure the distortion of a candidate coreset. We provide some
evidence as to why this might be computationally difficult. To complement this,
we propose a benchmark for which we argue that computing coresets is
challenging and which also allows us an easy (heuristic) evaluation of
coresets. Using this benchmark and real-world data sets, we conduct an
exhaustive evaluation of the most commonly used coreset algorithms from theory
and practice.
Related papers
- Achieving More with Less: A Tensor-Optimization-Powered Ensemble Method [53.170053108447455]
Ensemble learning is a method that leverages weak learners to produce a strong learner.
We design a smooth and convex objective function that leverages the concept of margin, making the strong learner more discriminative.
We then compare our algorithm with random forests of ten times the size and other classical methods across numerous datasets.
arXiv Detail & Related papers (2024-08-06T03:42:38Z) - Simple Weak Coresets for Non-Decomposable Classification Measures [3.5819148482955514]
We show that uniform sampling based coresets have excellent empirical performance backed by theoretical guarantees too.
We focus on the F1 score and Matthews Correlation Coefficient, two widely used non-decomposable objective functions that are nontrivial to optimize for and show that uniform coresets attain a lower bound for coreset size.
arXiv Detail & Related papers (2023-12-15T15:32:25Z) - Composable Core-sets for Diversity Approximation on Multi-Dataset
Streams [4.765131728094872]
Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data.
We introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments.
arXiv Detail & Related papers (2023-08-10T23:24:51Z) - AutoCoreset: An Automatic Practical Coreset Construction Framework [65.37876706107764]
A coreset is a tiny weighted subset of an input set, that closely resembles the loss function.
We propose an automatic framework for constructing coresets, which requires only the input data and the desired cost function from the user.
We show that while this set is limited, the coreset is quite general.
arXiv Detail & Related papers (2023-05-19T19:59:52Z) - Introduction to Coresets: Approximated Mean [29.520871474641485]
A emphstrong coreset for the mean queries of a set $P$ in $mathbbRd$ is a small weighted subset $Csubseteq P$.
A emphweak coreset is (also) a small weighted subset $C$ of $P$, whose mean approximates the mean of $P$.
arXiv Detail & Related papers (2021-11-04T17:49:38Z) - A Unified Approach to Coreset Learning [24.79658173754555]
Coreset of a given dataset and loss function is usually a small weighed set that approximates this loss for every query from a given set of queries.
We propose a generic, learning-based algorithm for construction of coresets.
arXiv Detail & Related papers (2021-11-04T17:48:05Z) - FriendlyCore: Practical Differentially Private Aggregation [67.04951703461657]
We propose a simple and practical tool $mathsfFriendlyCore$ that takes a set of points $cal D$ from an unrestricted (pseudo) metric space as input.
When $cal D$ has effective diameter $r$, $mathsfFriendlyCore$ returns a "stable" subset $cal D_Gsubseteq cal D$ that includes all points.
$mathsfFriendlyCore$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy
arXiv Detail & Related papers (2021-10-19T17:43:50Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Ranking a set of objects: a graph based least-square approach [70.7866286425868]
We consider the problem of ranking $N$ objects starting from a set of noisy pairwise comparisons provided by a crowd of equal workers.
We propose a class of non-adaptive ranking algorithms that rely on a least-squares intrinsic optimization criterion for the estimation of qualities.
arXiv Detail & Related papers (2020-02-26T16:19:09Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.