Efficient Discovery of Significant Patterns with Few-Shot Resampling
- URL: http://arxiv.org/abs/2406.11803v1
- Date: Mon, 17 Jun 2024 17:49:27 GMT
- Title: Efficient Discovery of Significant Patterns with Few-Shot Resampling
- Authors: Leonardo Pellegrina, Fabio Vandin,
- Abstract summary: In biomedicine, basket market analysis, and social networks, the goal is to discover patterns whose association with the target is defined with respect to an underlying population.
A natural way to capture the association of a pattern with the target is to consider its statistical significance, assessing its deviation from the (null) hypothesis of independence between the pattern and the target.
We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries.
- Score: 9.681286056736292
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant pattern mining is a fundamental task in mining transactional data, requiring to identify patterns significantly associated with the value of a given feature, the target. In several applications, such as biomedicine, basket market analysis, and social networks, the goal is to discover patterns whose association with the target is defined with respect to an underlying population, or process, of which the dataset represents only a collection of observations, or samples. A natural way to capture the association of a pattern with the target is to consider its statistical significance, assessing its deviation from the (null) hypothesis of independence between the pattern and the target. While several algorithms have been proposed to find statistically significant patterns, it remains a computationally demanding task, and for complex patterns such as subgroups, no efficient solution exists. We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries. FSR builds on a novel general framework for mining significant patterns that captures some of the most commonly considered patterns, including itemsets, sequential patterns, and subgroups. FSR uses a small number of resampled datasets, obtained by assigning i.i.d. labels to each transaction, to rigorously bound the supremum deviation of a quality statistic measuring the significance of patterns. FSR builds on novel tight bounds on the supremum deviation that require to mine a small number of resampled datasets, while providing a high effectiveness in discovering significant patterns. As a test case, we consider significant subgroup mining, and our evaluation on several real datasets shows that FSR is effective in discovering significant subgroups, while requiring a small number of resampled datasets.
Related papers
- SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised
Learning for Robust Infrared Small Target Detection [53.19618419772467]
Single-frame infrared small target (SIRST) detection aims to recognize small targets from clutter backgrounds.
With the development of Transformer, the scale of SIRST models is constantly increasing.
With a rich diversity of infrared small target data, our algorithm significantly improves the model performance and convergence speed.
arXiv Detail & Related papers (2024-03-08T16:14:54Z) - Integrating Statistical Significance and Discriminative Power in Pattern
Discovery [2.1014808520898667]
Proposed methodology integrates statistical significance and discriminative power criteria into state-of-the-art algorithms.
Tests show the role of the proposed methodology in discovering patterns with pronounced improvements of discriminative power and statistical significance without quality deterioration.
arXiv Detail & Related papers (2024-01-22T14:51:01Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions.
We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z) - A Causality-Aware Pattern Mining Scheme for Group Activity Recognition
in a Pervasive Sensor Space [2.5486448837945765]
We propose an efficient group activity recognition scheme for HAR in a smart space.
A set of rules is leveraged to highlight causally related events in a given data stream.
A pattern-tree algorithm extracts frequent causal patterns by means of a growing tree structure.
Experiment results show that the proposed scheme performs higher recognition accuracy and with a small amount of runtime overhead.
arXiv Detail & Related papers (2023-12-01T07:54:07Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - A Global Model Approach to Robust Few-Shot SAR Automatic Target
Recognition [6.260916845720537]
It may not always be possible to collect hundreds of labeled samples per class for training deep learning-based SAR Automatic Target Recognition (ATR) models.
This work specifically tackles the few-shot SAR ATR problem, where only a handful of labeled samples may be available to support the task of interest.
arXiv Detail & Related papers (2023-03-20T00:24:05Z) - Task Agnostic and Post-hoc Unseen Distribution Detection [27.69612483621752]
We propose a task-agnostic and post-hoc Unseen Distribution Detection (TAPUDD) method.
It comprises of TAP-Mahalanobis, which clusters the training datasets' features and determines the minimum Mahalanobis distance of the test sample from all clusters.
We show that our method can detect unseen samples effectively across diverse tasks and performs better or on-par with the existing baselines.
arXiv Detail & Related papers (2022-07-26T17:55:15Z) - Approximate Network Motif Mining Via Graph Learning [4.2873412319680035]
Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many graph datasets.
High computational complexity of identifying motif sets in arbitrary datasets has limited their use in many real-world datasets.
By automatically leveraging statistical properties of datasets, machine learning approaches have shown promise in several tasks with complexity.
arXiv Detail & Related papers (2022-06-02T12:15:05Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.