A new algorithm for Subgroup Set Discovery based on Information Gain
- URL: http://arxiv.org/abs/2307.15089v2
- Date: Mon, 31 Jul 2023 08:26:12 GMT
- Title: A new algorithm for Subgroup Set Discovery based on Information Gain
- Authors: Daniel G\'omez-Bravo, Aaron Garc\'ia, Guillermo Vigueras, Bel\'en
R\'ios, Alejandro Rodr\'iguez-Gonz\'alez
- Abstract summary: Information Gained Subgroup Discovery (IGSD) is a new SD algorithm for pattern discovery.
We compare IGSD with two state-of-the-art SD algorithms: FSSD and SSD++.
IGSD provides better OR values than FSSD and SSD++, stating a higher dependence between patterns and targets.
- Score: 58.720142291102135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pattern discovery is a machine learning technique that aims to find sets of
items, subsequences, or substructures that are present in a dataset with a
higher frequency value than a manually set threshold. This process helps to
identify recurring patterns or relationships within the data, allowing for
valuable insights and knowledge extraction. In this work, we propose
Information Gained Subgroup Discovery (IGSD), a new SD algorithm for pattern
discovery that combines Information Gain (IG) and Odds Ratio (OR) as a
multi-criteria for pattern selection. The algorithm tries to tackle some
limitations of state-of-the-art SD algorithms like the need for fine-tuning of
key parameters for each dataset, usage of a single pattern search criteria set
by hand, usage of non-overlapping data structures for subgroup space
exploration, and the impossibility to search for patterns by fixing some
relevant dataset variables. Thus, we compare the performance of IGSD with two
state-of-the-art SD algorithms: FSSD and SSD++. Eleven datasets are assessed
using these algorithms. For the performance evaluation, we also propose to
complement standard SD measures with IG, OR, and p-value. Obtained results show
that FSSD and SSD++ algorithms provide less reliable patterns and reduced sets
of patterns than IGSD algorithm for all datasets considered. Additionally, IGSD
provides better OR values than FSSD and SSD++, stating a higher dependence
between patterns and targets. Moreover, patterns obtained for one of the
datasets used, have been validated by a group of domain experts. Thus, patterns
provided by IGSD show better agreement with experts than patterns obtained by
FSSD and SSD++ algorithms. These results demonstrate the suitability of the
IGSD as a method for pattern discovery and suggest that the inclusion of
non-standard SD metrics allows to better evaluate discovered patterns.
Related papers
- RHiOTS: A Framework for Evaluating Hierarchical Time Series Forecasting Algorithms [0.393259574660092]
RHiOTS is designed to assess the robustness of hierarchical time series forecasting models and algorithms on real-world datasets.
RHiOTS incorporates an innovative visualization component, turning complex, multidimensional robustness evaluation results into intuitive, easily interpretable visuals.
Our findings show that traditional statistical methods are more robust than state-of-the-art deep learning algorithms, except when the transformation effect is highly disruptive.
arXiv Detail & Related papers (2024-08-06T18:52:15Z) - ARC: A Generalist Graph Anomaly Detector with In-Context Learning [62.202323209244]
ARC is a generalist GAD approach that enables a one-for-all'' GAD model to detect anomalies across various graph datasets on-the-fly.
equipped with in-context learning, ARC can directly extract dataset-specific patterns from the target dataset.
Extensive experiments on multiple benchmark datasets from various domains demonstrate the superior anomaly detection performance, efficiency, and generalizability of ARC.
arXiv Detail & Related papers (2024-05-27T02:42:33Z) - Integrating Statistical Significance and Discriminative Power in Pattern
Discovery [2.1014808520898667]
Proposed methodology integrates statistical significance and discriminative power criteria into state-of-the-art algorithms.
Tests show the role of the proposed methodology in discovering patterns with pronounced improvements of discriminative power and statistical significance without quality deterioration.
arXiv Detail & Related papers (2024-01-22T14:51:01Z) - Learning nonparametric DAGs with incremental information via high-order
HSIC [13.061477915002767]
We present an identifiability condition based on a determined subset of parents to identify the underlying DAG.
In the optimal phase, an optimization problem based on first-order Hilbert-optimal independence criterion (HSIC) gives an estimated skeleton as the initial determined parents subset.
In the tuning phase, the skeleton is locally tuned by deletion, addition and DAG-formalization strategies.
arXiv Detail & Related papers (2023-08-11T07:07:21Z) - Interpretable Out-Of-Distribution Detection Using Pattern Identification [0.0]
Out-of-distribution (OoD) detection for data-based programs is a goal of paramount importance.
Common approaches in the literature tend to train detectors requiring inside-of-distribution (in-distribution, or IoD) and OoD validation samples.
We propose to use existing work from the field of explainable AI, namely the PARTICUL pattern identification algorithm, in order to build more interpretable and robust OoD detectors.
arXiv Detail & Related papers (2023-01-24T15:35:54Z) - Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - A Systematic Characterization of Sampling Algorithms for Open-ended
Language Generation [71.31905141672529]
We study the widely adopted ancestral sampling algorithms for auto-regressive language models.
We identify three key properties that are shared among them: entropy reduction, order preservation, and slope preservation.
We find that the set of sampling algorithms that satisfies these properties performs on par with the existing sampling algorithms.
arXiv Detail & Related papers (2020-09-15T17:28:42Z) - The Data Representativeness Criterion: Predicting the Performance of
Supervised Classification Based on Data Set Similarity [4.934817254755008]
We propose the Data Representativeness Criterion (DRC) to determine how representative a training data set is of a new unseen data set.
We present a proof of principle, to see whether the DRC can quantify the similarity of data sets and whether the DRC relates to the performance of a supervised classification algorithm.
arXiv Detail & Related papers (2020-02-27T15:08:13Z) - CONSAC: Robust Multi-Model Fitting by Conditional Sample Consensus [62.86856923633923]
We present a robust estimator for fitting multiple parametric models of the same form to noisy measurements.
In contrast to previous works, which resorted to hand-crafted search strategies for multiple model detection, we learn the search strategy from data.
For self-supervised learning of the search, we evaluate the proposed algorithm on multi-homography estimation and demonstrate an accuracy that is superior to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-08T17:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.