A Coreset Learning Reality Check
- URL: http://arxiv.org/abs/2301.06163v1
- Date: Sun, 15 Jan 2023 19:26:17 GMT
- Title: A Coreset Learning Reality Check
- Authors: Fred Lu, Edward Raff, James Holt
- Abstract summary: Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets.
In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification.
We compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness.
- Score: 33.002265576337486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subsampling algorithms are a natural approach to reduce data size before
fitting models on massive datasets. In recent years, several works have
proposed methods for subsampling rows from a data matrix while maintaining
relevant information for classification. While these works are supported by
theory and limited experiments, to date there has not been a comprehensive
evaluation of these methods. In our work, we directly compare multiple methods
for logistic regression drawn from the coreset and optimal subsampling
literature and discover inconsistencies in their effectiveness. In many cases,
methods do not outperform simple uniform subsampling.
Related papers
- A Closer Look at Deep Learning Methods on Tabular Datasets [52.50778536274327]
Tabular data is prevalent across diverse domains in machine learning.
Deep Neural Network (DNN)-based methods have recently demonstrated promising performance.
We compare 32 state-of-the-art deep and tree-based methods, evaluating their average performance across multiple criteria.
arXiv Detail & Related papers (2024-07-01T04:24:07Z) - Unsupervised Estimation of Ensemble Accuracy [0.0]
We present a method for estimating the joint power of several classifiers.
It differs from existing approaches which focus on "diversity" measures by not relying on labels.
We demonstrate the method on popular large-scale face recognition datasets.
arXiv Detail & Related papers (2023-11-18T02:31:36Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - Convolutional autoencoder-based multimodal one-class classification [80.52334952912808]
One-class classification refers to approaches of learning using data from a single class only.
We propose a deep learning one-class classification method suitable for multimodal data.
arXiv Detail & Related papers (2023-09-25T12:31:18Z) - Composable Core-sets for Diversity Approximation on Multi-Dataset
Streams [4.765131728094872]
Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data.
We introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments.
arXiv Detail & Related papers (2023-08-10T23:24:51Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Monotonic Cardinality Estimation of Similarity Selection: A Deep
Learning Approach [22.958342743597044]
We investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection.
We propose a novel and generic method that can be applied to any data type and distance function.
arXiv Detail & Related papers (2020-02-15T20:22:51Z) - Domain Adaptive Bootstrap Aggregating [5.444459446244819]
bootstrap aggregating, or bagging, is a popular method for improving stability of predictive algorithms.
This article proposes a domain adaptive bagging method coupled with a new iterative nearest neighbor sampler.
arXiv Detail & Related papers (2020-01-12T20:02:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.