Statistical Inference for Clustering-based Anomaly Detection
- URL: http://arxiv.org/abs/2504.18633v1
- Date: Fri, 25 Apr 2025 18:21:26 GMT
- Title: Statistical Inference for Clustering-based Anomaly Detection
- Authors: Nguyen Thi Minh Phu, Duong Tan Loc, Vo Nguyen Le Duy,
- Abstract summary: Unsupervised anomaly detection is a fundamental problem in machine learning and statistics.<n>We propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results.
- Score: 7.10052009802944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $\alpha$ (e.g., $\alpha = 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.
Related papers
- Unsupervised Clustering Approaches for Autism Screening: Achieving 95.31% Accuracy with a Gaussian Mixture Model [0.0]
Autism spectrum disorder (ASD) remains a challenging condition to diagnose effectively and promptly.
Traditional diagnostic methods presuppose the availability of labeled data, which can be both time-consuming and resource-intensive to obtain.
This paper explores the use of four distinct unsupervised clustering algorithms to analyze a publicly available dataset of 704 adult individuals screened for ASD.
arXiv Detail & Related papers (2025-02-20T18:12:59Z) - A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy.<n>We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods.<n>By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z) - Controllable RANSAC-based Anomaly Detection via Hypothesis Testing [7.10052009802944]
We propose a novel statistical method for testing the anomaly detection results obtained by RANSAC (controllable RANSAC)
The key strength of the proposed method lies in its ability to control the probability of misidentifying anomalies below a pre-specified level.
Experiments conducted on synthetic and real-world datasets robustly support our theoretical results.
arXiv Detail & Related papers (2024-10-19T15:15:41Z) - FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data [11.42231457116486]
FedAD-Bench is a benchmark for evaluating unsupervised anomaly detection algorithms within the context of federated learning.
We identify key challenges such as model aggregation inefficiencies and metric unreliability.
Our work aims to establish a standardized benchmark to guide future research and development in federated anomaly detection.
arXiv Detail & Related papers (2024-08-08T13:14:19Z) - GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - On the Universal Adversarial Perturbations for Efficient Data-free
Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs.
Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z) - Learning Prompt-Enhanced Context Features for Weakly-Supervised Video
Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges.
We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability.
Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z) - Parametric Classification for Generalized Category Discovery: A Baseline
Study [70.73212959385387]
Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples.
We investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem.
We propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers.
arXiv Detail & Related papers (2022-11-21T18:47:11Z) - Learn then Test: Calibrating Predictive Algorithms to Achieve Risk
Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models.
Our main insight is to reframe the risk-control problem as multiple hypothesis testing.
We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z) - Efficient Intrusion Detection Using Evidence Theory [0.0]
Intrusion Detection Systems (IDS) are now an essential element when it comes to securing computers and networks.
This paper proposes a novel contextual discounting method based on sources' reliability and their distinguishing ability between normal and abnormal behavior.
arXiv Detail & Related papers (2021-03-15T17:54:16Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.