Label-consistent clustering for evolving data
- URL: http://arxiv.org/abs/2512.15210v1
- Date: Wed, 17 Dec 2025 09:03:59 GMT
- Title: Label-consistent clustering for evolving data
- Authors: Ameet Gadekar, Aristides Gionis, Thibault Marette,
- Abstract summary: We study the above problem in the context of clustering, specifically focusing on the $k$-center problem.<n>We propose two constant-factor approximation algorithms for it.<n>We complement our theoretical findings with an experimental evaluation demonstrating the effectiveness of our methods on real-world datasets.
- Score: 14.91460660469822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data analysis often involves an iterative process, where solutions must be continuously refined in response to new data. Typically, as new data becomes available, an existing solution must be updated to incorporate the latest information. In addition to seeking a high-quality solution for the task at hand, it is also crucial to ensure consistency by minimizing drastic changes from previous solutions. Applying this approach across many iterations, ensures that the solution evolves gradually and smoothly. In this paper, we study the above problem in the context of clustering, specifically focusing on the $k$-center problem. More precisely, we study the following problem: Given a set of points $X$, parameters $k$ and $b$, and a prior clustering solution $H$ for $X$, our goal is to compute a new solution $C$ for $X$, consisting of $k$ centers, which minimizes the clustering cost while introducing at most $b$ changes from $H$. We refer to this problem as label-consistent $k$-center, and we propose two constant-factor approximation algorithms for it. We complement our theoretical findings with an experimental evaluation demonstrating the effectiveness of our methods on real-world datasets.
Related papers
- Data Selection for ERMs [67.57726352698933]
We study how well can $mathcalA$ perform when trained on at most $nll N$ data points selected from a population of $N$ points.<n>Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression.
arXiv Detail & Related papers (2025-04-20T11:26:01Z) - Dynamic Consistent $k$-Center Clustering with Optimal Recourse [0.6077284832583713]
We prove that the $k$-center clustering problem admits optimal recourse bounds by developing a deterministic constant-factor approximation with worst-case recourse of $1$ per update.<n>Our incremental algorithm improves over the $8$-approximation algorithm by Charikar, Chekuri, Feder, and Motwani.
arXiv Detail & Related papers (2024-12-04T11:39:03Z) - Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems [8.74967598360817]
Given a dataset comprising several groups, the fairness constraint requires that each cluster should contain a proportion of points from each group.<n>We propose a novel Relax and Merge'' framework, where $rho$ is the approximate ratio of an off-the-shelf vanilla $k$-means algorithm.<n>If equipped with a PTAS of $k$-means, our solution can achieve an approximation ratio of $(5+O(epsilon))$ with only a slight violation of the fairness constraints.
arXiv Detail & Related papers (2024-11-02T02:50:12Z) - Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights [16.120911591795295]
In some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a subset of points, referred as facilities or suppliers.
In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group.
arXiv Detail & Related papers (2024-10-16T18:00:19Z) - Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [72.69498649272347]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - On the Necessity of Collaboration for Online Model Selection with Decentralized Data [53.244188985271606]
We consider online model selection with decentralized data over $M$ clients, and study the necessity of collaboration among clients.
Our results show (i) collaboration is unnecessary in the absence of computational constraints on clients; (ii) collaboration is necessary if the computational cost on each client is limited to $o(K)$, where $K$ is the number of candidate hypothesis spaces.
arXiv Detail & Related papers (2024-04-15T06:32:28Z) - Improved Learning-augmented Algorithms for k-means and k-medians
Clustering [8.04779839951237]
We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space.
We propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost.
Our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $alpha$ up to $1/2$, an improvement over $alpha$ being at most $1/7$ in the previous work.
arXiv Detail & Related papers (2022-10-31T03:00:11Z) - Local Stochastic Bilevel Optimization with Momentum-Based Variance
Reduction [104.41634756395545]
We study Federated Bilevel Optimization problems. Specifically, we first propose the FedBiO, a deterministic gradient-based algorithm.
We show FedBiO has complexity of $O(epsilon-1.5)$.
Our algorithms show superior performances compared to other baselines in numerical experiments.
arXiv Detail & Related papers (2022-05-03T16:40:22Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - A New Notion of Individually Fair Clustering: $\alpha$-Equitable
$k$-Center [31.936733991407134]
We introduce a novel definition of fairness for clustering problems.
In our model each point $j$ has a set of other points $mathcalS_j$ that it perceives as similar to itself.
We provide efficient and easily implementable approximation algorithms for the $k$-center objective.
arXiv Detail & Related papers (2021-06-09T22:52:00Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.