The Data Representativeness Criterion: Predicting the Performance of
Supervised Classification Based on Data Set Similarity
- URL: http://arxiv.org/abs/2002.12105v1
- Date: Thu, 27 Feb 2020 15:08:13 GMT
- Title: The Data Representativeness Criterion: Predicting the Performance of
Supervised Classification Based on Data Set Similarity
- Authors: Evelien Schat, Rens van de Schoot, Wouter M. Kouw, Duco Veen,
Adri\"enne M. Mendrik
- Abstract summary: We propose the Data Representativeness Criterion (DRC) to determine how representative a training data set is of a new unseen data set.
We present a proof of principle, to see whether the DRC can quantify the similarity of data sets and whether the DRC relates to the performance of a supervised classification algorithm.
- Score: 4.934817254755008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a broad range of fields it may be desirable to reuse a supervised
classification algorithm and apply it to a new data set. However,
generalization of such an algorithm and thus achieving a similar classification
performance is only possible when the training data used to build the algorithm
is similar to new unseen data one wishes to apply it to. It is often unknown in
advance how an algorithm will perform on new unseen data, being a crucial
reason for not deploying an algorithm at all. Therefore, tools are needed to
measure the similarity of data sets. In this paper, we propose the Data
Representativeness Criterion (DRC) to determine how representative a training
data set is of a new unseen data set. We present a proof of principle, to see
whether the DRC can quantify the similarity of data sets and whether the DRC
relates to the performance of a supervised classification algorithm. We
compared a number of magnetic resonance imaging (MRI) data sets, ranging from
subtle to severe difference is acquisition parameters. Results indicate that,
based on the similarity of data sets, the DRC is able to give an indication as
to when the performance of a supervised classifier decreases. The strictness of
the DRC can be set by the user, depending on what one considers to be an
acceptable underperformance.
Related papers
- The importance of the clustering model to detect new types of intrusion in data traffic [0.0]
The presented work use K-means algorithm, which is a popular clustering technique.
Data was gathered utilizing Kali Linux environment, cicflowmeter traffic, and Putty Software tools.
The model counted the attacks and assigned numbers to each one of them.
arXiv Detail & Related papers (2024-11-21T19:40:31Z) - DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW)
DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster.
After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z) - Spectral Clustering of Categorical and Mixed-type Data via Extra Graph
Nodes [0.0]
This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm.
We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function.
We demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data.
arXiv Detail & Related papers (2024-03-08T20:49:49Z) - Composable Core-sets for Diversity Approximation on Multi-Dataset
Streams [4.765131728094872]
Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data.
We introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments.
arXiv Detail & Related papers (2023-08-10T23:24:51Z) - A new algorithm for Subgroup Set Discovery based on Information Gain [58.720142291102135]
Information Gained Subgroup Discovery (IGSD) is a new SD algorithm for pattern discovery.
We compare IGSD with two state-of-the-art SD algorithms: FSSD and SSD++.
IGSD provides better OR values than FSSD and SSD++, stating a higher dependence between patterns and targets.
arXiv Detail & Related papers (2023-07-26T21:42:34Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Auditing for Diversity using Representative Examples [17.016881905579044]
We propose a cost-effective approach to approximate the disparity of a given unlabeled dataset.
Our proposed algorithm uses the pairwise similarity between elements in the dataset and elements in the control set to effectively bootstrap an approximation.
We show that using a control set whose size is much smaller than the size of the dataset is sufficient to achieve a small approximation error.
arXiv Detail & Related papers (2021-07-15T15:21:17Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - DAC: Deep Autoencoder-based Clustering, a General Deep Learning
Framework of Representation Learning [0.0]
We propose DAC, Deep Autoencoder-based Clustering, a data-driven framework to learn clustering representations using deep neuron networks.
Experiment results show that our approach could effectively boost performance of the KMeans clustering algorithm on a variety of datasets.
arXiv Detail & Related papers (2021-02-15T11:31:00Z) - Dual Adversarial Auto-Encoders for Clustering [152.84443014554745]
We propose Dual Adversarial Auto-encoder (Dual-AAE) for unsupervised clustering.
By performing variational inference on the objective function of Dual-AAE, we derive a new reconstruction loss which can be optimized by training a pair of Auto-encoders.
Experiments on four benchmarks show that Dual-AAE achieves superior performance over state-of-the-art clustering methods.
arXiv Detail & Related papers (2020-08-23T13:16:34Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.