Composable Core-sets for Diversity Approximation on Multi-Dataset
Streams
- URL: http://arxiv.org/abs/2308.05878v1
- Date: Thu, 10 Aug 2023 23:24:51 GMT
- Title: Composable Core-sets for Diversity Approximation on Multi-Dataset
Streams
- Authors: Stephanie Wang, Michael Flynn, and Fangyu Luo
- Abstract summary: Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data.
We introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments.
- Score: 4.765131728094872
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Core-sets refer to subsets of data that maximize some function that is
commonly a diversity or group requirement. These subsets are used in place of
the original data to accomplish a given task with comparable or even enhanced
performance if biases are removed. Composable core-sets are core-sets with the
property that subsets of the core set can be unioned together to obtain an
approximation for the original data; lending themselves to be used for streamed
or distributed data. Recent work has focused on the use of core-sets for
training machine learning models. Preceding solutions such as CRAIG have been
proven to approximate gradient descent while providing a reduced training time.
In this paper, we introduce a core-set construction algorithm for constructing
composable core-sets to summarize streamed data for use in active learning
environments. If combined with techniques such as CRAIG and heuristics to
enhance construction speed, composable core-sets could be used for real time
training of models when the amount of sensor data is large. We provide
empirical analysis by considering extrapolated data for the runtime of such a
brute force algorithm. This algorithm is then analyzed for efficiency through
averaged empirical regression and key results and improvements are suggested
for further research on the topic.
Related papers
- Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Adaptive Second Order Coresets for Data-efficient Machine Learning [5.362258158646462]
Training machine learning models on datasets incurs substantial computational costs.
We propose AdaCore to extract subsets of the training examples for efficient machine learning.
arXiv Detail & Related papers (2022-07-28T05:43:09Z) - Random projections and Kernelised Leave One Cluster Out
Cross-Validation: Universal baselines and evaluation tools for supervised
machine learning for materials properties [10.962094053749093]
leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials.
We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to enhance LOCO-CV applications.
We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception.
arXiv Detail & Related papers (2022-06-17T15:39:39Z) - Robust Coreset for Continuous-and-Bounded Learning (with Outliers) [30.91741925182613]
We propose a novel robust coreset method for the em continuous-and-bounded learning problem (with outliers)
Our robust coreset can be efficiently maintained in fully-dynamic environment.
arXiv Detail & Related papers (2021-06-30T19:24:20Z) - Coresets via Bilevel Optimization for Continual Learning and Streaming [86.67190358712064]
We propose a novel coreset construction via cardinality-constrained bilevel optimization.
We show how our framework can efficiently generate coresets for deep neural networks, and demonstrate its empirical benefits in continual learning and in streaming settings.
arXiv Detail & Related papers (2020-06-06T14:20:25Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z) - Uncovering Coresets for Classification With Multi-Objective Evolutionary
Algorithms [0.8057006406834467]
A coreset is a subset of the training set, using which a machine learning algorithm obtains performances similar to what it would deliver if trained over the whole original data.
A novel approach is presented: candidate corsets are iteratively optimized, adding and removing samples.
A multi-objective evolutionary algorithm is used to minimize simultaneously the number of points in the set and the classification error.
arXiv Detail & Related papers (2020-02-20T09:59:56Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.