Zero-shot Outlier Detection via Prior-data Fitted Networks: Model Selection Bygone!
- URL: http://arxiv.org/abs/2409.05672v2
- Date: Thu, 06 Feb 2025 19:40:04 GMT
- Title: Zero-shot Outlier Detection via Prior-data Fitted Networks: Model Selection Bygone!
- Authors: Yuchen Shen, Haomin Wen, Leman Akoglu,
- Abstract summary: FoMo-0D is a pre-trained Foundation Model for zero/0-shot OD on tabular data.
It can directly predict the (outlier/inlier) label of test samples without parameter fine-tuning.
Experiments on 57 real-world datasets show that FoMo-0D significantly outperforms the vast majority of the baselines.
- Score: 28.823740273813296
- License:
- Abstract: Outlier detection (OD) has a vast literature as it finds numerous real-world applications. Being an inherently unsupervised task, model selection is a key bottleneck for OD without label supervision. Despite many OD techniques are available to choose from, algorithm and hyperparameter selection remain challenging for OD, limiting its effective use in practice. In this paper, we present FoMo-0D, a pre-trained Foundation Model for zero/0-shot OD on tabular data, which bypasses the hurdle of model selection. To overcome the difficulty of labeled data collection, FoMo-0D is trained on synthetic data and can directly predict the (outlier/inlier) label of test samples without parameter fine-tuning -- making the need obsolete for choosing an algorithm/architecture and tuning its associated hyperparameters when given a new OD dataset. Extensive experiments on 57 real-world datasets against 26 baselines show that FoMo-0D significantly outperforms the vast majority of the baselines and is statistically no different from the 2nd best method, with an average inference time of 7.7 ms per sample, offering at least 7x speed-up compared to previous methods. To facilitate future research, our implementations and checkpoints are openly available at https://anonymous.4open.science/r/PFN40D.
Related papers
- Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation [26.312206159418903]
Unsupervised anomaly detection (UAD) plays an important role in modern data analytics.
We present a novel UAD method by evaluating how much noise is in the data.
We provide theoretical guarantees, proving that the proposed method can detect anomalous data successfully.
arXiv Detail & Related papers (2024-12-16T05:35:58Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - Unsupervised Model Selection for Time-series Anomaly Detection [7.8027110514393785]
We identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies.
We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem.
Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model.
arXiv Detail & Related papers (2022-10-03T16:49:30Z) - Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models [0.0]
Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models.
We present a sequential selection method that identifies critically important information within a dataset.
We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails.
arXiv Detail & Related papers (2022-08-27T19:43:53Z) - Efficient Testing of Deep Neural Networks via Decision Boundary Analysis [28.868479656437145]
We propose a novel technique, named Aries, that can estimate the performance of DNNs on new unlabeled data.
The estimated accuracy by Aries is only 0.03% -- 2.60% (on average 0.61%) off the true accuracy.
arXiv Detail & Related papers (2022-07-22T08:39:10Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Automating Outlier Detection via Meta-Learning [37.736124230543865]
We develop the first principled data-driven approach to model selection for outlier detection, called MetaOD, based on meta-learning.
We show the effectiveness of MetaOD in selecting a detection model that significantly outperforms the most popular outlier detectors.
To foster and further research on this new problem, we open-source our entire meta-learning system, benchmark environment, and testbed datasets.
arXiv Detail & Related papers (2020-09-22T15:14:45Z) - Contextual-Bandit Anomaly Detection for IoT Data in Distributed
Hierarchical Edge Computing [65.78881372074983]
IoT devices can hardly afford complex deep neural networks (DNN) models, and offloading anomaly detection tasks to the cloud incurs long delay.
We propose and build a demo for an adaptive anomaly detection approach for distributed hierarchical edge computing (HEC) systems.
We show that our proposed approach significantly reduces detection delay without sacrificing accuracy, as compared to offloading detection tasks to the cloud.
arXiv Detail & Related papers (2020-04-15T06:13:33Z) - SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier
Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples.
We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.