Related papers: A Weighted K-Center Algorithm for Data Subset Selection

A Weighted K-Center Algorithm for Data Subset Selection

URL: http://arxiv.org/abs/2312.10602v1
Date: Sun, 17 Dec 2023 04:41:07 GMT
Title: A Weighted K-Center Algorithm for Data Subset Selection
Authors: Srikumar Ramalingam, Pranjal Awasthi, Sanjiv Kumar
Abstract summary: Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
Score: 70.49696246526199
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs. Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data, which can then be used to produce similar models as the ones trained with full data. Two prior methods are shown to achieve impressive results: (1) margin sampling that focuses on selecting points with high uncertainty, and (2) core-sets or clustering methods such as k-center for informative and diverse subsets. We are not aware of any work that combines these methods in a principled manner. To this end, we develop a novel and efficient factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions. To handle large datasets, we show a parallel algorithm to run on multiple machines with approximation guarantees. The proposed algorithm achieves similar or better performance compared to other strong baselines on vision datasets such as CIFAR-10, CIFAR-100, and ImageNet.

Related papers

Clust-Splitter $-$ an Efficient Nonsmooth Optimization-Based Algorithm for Clustering Large Datasets [0.0]
We introduce Clust-Splitter, an efficient algorithm based on nonsmooth optimization, to solve the minimum sum-of-squares clustering problem.<n>We evaluate Clust-Splitter on real-world datasets characterized by both a large number of attributes and a large number of data points.
arXiv Detail & Related papers (2025-05-07T13:13:46Z)
TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data [29.45013725650798]
It is essential to extract a subset of instruction datasets that achieves comparable performance to the full dataset. We propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS) Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection.
arXiv Detail & Related papers (2024-07-21T17:59:20Z)
Composable Core-sets for Diversity Approximation on Multi-Dataset Streams [4.765131728094872]
Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data. We introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments.
arXiv Detail & Related papers (2023-08-10T23:24:51Z)
Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data [100.33096338195723]
We focus on Few-shot Learning with Auxiliary Data (FLAD) FLAD assumes access to auxiliary data during few-shot learning in hopes of improving generalization. We propose two algorithms -- EXP3-FLAD and UCB1-FLAD -- and compare them with prior FLAD methods that either explore or exploit.
arXiv Detail & Related papers (2023-02-01T18:59:36Z)
Scalable Batch Acquisition for Deep Bayesian Active Learning [70.68403899432198]
In deep active learning, it is important to choose multiple examples to markup at each step. Existing solutions to this problem, such as BatchBALD, have significant limitations in selecting a large number of examples. We present the Large BatchBALD algorithm, which aims to achieve comparable quality while being more computationally efficient.
arXiv Detail & Related papers (2023-01-13T11:45:17Z)
SSDBCODI: Semi-Supervised Density-Based Clustering with Outliers Detection Integrated [1.8444322599555096]
Clustering analysis is one of the critical tasks in machine learning. Due to the fact that the performance of clustering clustering can be significantly eroded by outliers, algorithms try to incorporate the process of outlier detection. We have proposed SSDBCODI, a semi-supervised detection element.
arXiv Detail & Related papers (2022-08-10T21:06:38Z)
A sampling-based approach for efficient clustering in large datasets [0.8952229340927184]
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters.
arXiv Detail & Related papers (2021-12-29T19:15:20Z)
Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points. We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z)
DAC: Deep Autoencoder-based Clustering, a General Deep Learning Framework of Representation Learning [0.0]
We propose DAC, Deep Autoencoder-based Clustering, a data-driven framework to learn clustering representations using deep neuron networks. Experiment results show that our approach could effectively boost performance of the KMeans clustering algorithm on a variety of datasets.
arXiv Detail & Related papers (2021-02-15T11:31:00Z)
Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms. DPPs favor diversity of the center points within subsets. We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z)
FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data. To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.