Related papers: An Empirical Evaluation of $k$-Means Coresets

An Empirical Evaluation of $k$-Means Coresets

URL: http://arxiv.org/abs/2207.00966v1
Date: Sun, 3 Jul 2022 06:47:53 GMT
Title: An Empirical Evaluation of $k$-Means Coresets
Authors: Chris Schwiegelshohn and Omar Ali Sheikh-Omar
Abstract summary: There is no work on comparing the quality of available $k$-means coresets. We propose a benchmark for which we argue that computing coresets is challenging. We conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice.
Score: 4.45709593827781
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as $k$-means in both theory and practice. Curiously, there exists no work on comparing the quality of available $k$-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice.

Related papers

Achieving More with Less: A Tensor-Optimization-Powered Ensemble Method [53.170053108447455]
Ensemble learning is a method that leverages weak learners to produce a strong learner. We design a smooth and convex objective function that leverages the concept of margin, making the strong learner more discriminative. We then compare our algorithm with random forests of ten times the size and other classical methods across numerous datasets.
arXiv Detail & Related papers (2024-08-06T03:42:38Z)
On Speeding Up Language Model Evaluation [48.51924035873411]
We propose an $textitadaptive$ approach to explore this space. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate. We show that it can identify the top-performing method using only 5-15% of the typical resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z)
Simple Weak Coresets for Non-Decomposable Classification Measures [3.5819148482955514]
We show that uniform sampling based coresets have excellent empirical performance backed by theoretical guarantees too. We focus on the F1 score and Matthews Correlation Coefficient, two widely used non-decomposable objective functions that are nontrivial to optimize for and show that uniform coresets attain a lower bound for coreset size.
arXiv Detail & Related papers (2023-12-15T15:32:25Z)
Composable Core-sets for Diversity Approximation on Multi-Dataset Streams [4.765131728094872]
Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data. We introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments.
arXiv Detail & Related papers (2023-08-10T23:24:51Z)
AutoCoreset: An Automatic Practical Coreset Construction Framework [65.37876706107764]
A coreset is a tiny weighted subset of an input set, that closely resembles the loss function. We propose an automatic framework for constructing coresets, which requires only the input data and the desired cost function from the user. We show that while this set is limited, the coreset is quite general.
arXiv Detail & Related papers (2023-05-19T19:59:52Z)
Introduction to Coresets: Approximated Mean [29.520871474641485]
A emphstrong coreset for the mean queries of a set $P$ in $mathbbRd$ is a small weighted subset $Csubseteq P$. A emphweak coreset is (also) a small weighted subset $C$ of $P$, whose mean approximates the mean of $P$.
arXiv Detail & Related papers (2021-11-04T17:49:38Z)
A Unified Approach to Coreset Learning [24.79658173754555]
Coreset of a given dataset and loss function is usually a small weighed set that approximates this loss for every query from a given set of queries. We propose a generic, learning-based algorithm for construction of coresets.
arXiv Detail & Related papers (2021-11-04T17:48:05Z)
FriendlyCore: Practical Differentially Private Aggregation [67.04951703461657]
We propose a simple and practical tool $mathsfFriendlyCore$ that takes a set of points $cal D$ from an unrestricted (pseudo) metric space as input. When $cal D$ has effective diameter $r$, $mathsfFriendlyCore$ returns a "stable" subset $cal D_Gsubseteq cal D$ that includes all points. $mathsfFriendlyCore$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy
arXiv Detail & Related papers (2021-10-19T17:43:50Z)
Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem. We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z)
Ranking a set of objects: a graph based least-square approach [70.7866286425868]
We consider the problem of ranking $N$ objects starting from a set of noisy pairwise comparisons provided by a crowd of equal workers. We propose a class of non-adaptive ranking algorithms that rely on a least-squares intrinsic optimization criterion for the estimation of qualities.
arXiv Detail & Related papers (2020-02-26T16:19:09Z)
On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points. We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.