A model-free subdata selection method for classification
- URL: http://arxiv.org/abs/2404.19127v1
- Date: Mon, 29 Apr 2024 22:05:29 GMT
- Title: A model-free subdata selection method for classification
- Authors: Rakhi Singh,
- Abstract summary: Subdata selection is a study of methods that select a small representative sample of the big data.
We propose a model-free subdata selection method for classification problems.
We show analytically that the PED subdata results in a smaller Gini than a uniform subdata.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets.
Related papers
- Data Pruning in Generative Diffusion Models [2.0111637969968]
Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets.
We show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically.
arXiv Detail & Related papers (2024-11-19T14:13:25Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - A Consistent and Scalable Algorithm for Best Subset Selection in Single
Index Models [1.3236116985407258]
Best subset selection in high-dimensional models is known to be computationally intractable.
We propose the first provably scalable algorithm for best subset selection in high-dimensional SIMs.
Our algorithm enjoys the subset selection consistency and has the oracle property with a high probability.
arXiv Detail & Related papers (2023-09-12T13:48:06Z) - Finding Meaningful Distributions of ML Black-boxes under Forensic
Investigation [25.79728190384834]
Given a poorly documented neural network model, we take the perspective of a forensic investigator who wants to find out the model's data domain.
We propose solving this problem by leveraging on comprehensive corpus such as ImageNet to select a meaningful distribution.
Our goal is to select a set of samples from the corpus for the given model.
arXiv Detail & Related papers (2023-05-10T03:25:23Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Hierarchical Representation via Message Propagation for Robust Model
Fitting [28.03005930782681]
We propose a novel hierarchical representation via message propagation (HRMP) method for robust model fitting.
We formulate the consensus information and the preference information as a hierarchical representation to alleviate the sensitivity to gross outliers.
The proposed HRMP can not only accurately estimate the number and parameters of multiple model instances, but also handle multi-structural data contaminated with a large number of outliers.
arXiv Detail & Related papers (2020-12-29T04:14:19Z) - The Gap on GAP: Tackling the Problem of Differing Data Distributions in
Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing.
undesired patterns in the collected data can make such tests incorrect.
We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z) - Self-Representation Based Unsupervised Exemplar Selection in a Union of
Subspaces [27.22427926657327]
We present a new exemplar selection model that searches for a subset that best reconstructs all data points as measured by the $ell_1$ norm of the representation coefficients.
When the dataset is drawn from a union of independent subspaces, our method is able to select sufficiently many representatives from each subspace.
We also develop an exemplar based subspace clustering method that is robust to imbalanced data and efficient for large scale data.
arXiv Detail & Related papers (2020-06-07T19:43:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.