Toward Unsupervised Outlier Model Selection
- URL: http://arxiv.org/abs/2211.01834v1
- Date: Thu, 3 Nov 2022 14:14:46 GMT
- Title: Toward Unsupervised Outlier Model Selection
- Authors: Yue Zhao, Sean Zhang, Leman Akoglu
- Abstract summary: ELECT is a new approach to select an effective model on a new dataset without any labels.
It is based on meta-learning; transferring prior knowledge (e.g. model performance) on historical datasets that are similar to the new one.
It can serve an output on-demand, being able to accommodate varying time budgets.
- Score: 20.12322454417006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Today there exists no shortage of outlier detection algorithms in the
literature, yet the complementary and critical problem of unsupervised outlier
model selection (UOMS) is vastly understudied. In this work we propose ELECT, a
new approach to select an effective candidate model, i.e. an outlier detection
algorithm and its hyperparameter(s), to employ on a new dataset without any
labels. At its core, ELECT is based on meta-learning; transferring prior
knowledge (e.g. model performance) on historical datasets that are similar to
the new one to facilitate UOMS. Uniquely, it employs a dataset similarity
measure that is performance-based, which is more direct and goal-driven than
other measures used in the past. ELECT adaptively searches for similar
historical datasets, as such, it can serve an output on-demand, being able to
accommodate varying time budgets. Extensive experiments show that ELECT
significantly outperforms a wide range of basic UOMS baselines, including no
model selection (always using the same popular model such as iForest) as well
as more recent selection strategies based on meta-features.
Related papers
- LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models [38.39395973523944]
We propose a three-stage scheme for data selection and review existing works according to this scheme.
We find that the more targeted method with data-specific and model-specific quality labels has higher efficiency.
arXiv Detail & Related papers (2024-06-20T08:58:58Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Automating Outlier Detection via Meta-Learning [37.736124230543865]
We develop the first principled data-driven approach to model selection for outlier detection, called MetaOD, based on meta-learning.
We show the effectiveness of MetaOD in selecting a detection model that significantly outperforms the most popular outlier detectors.
To foster and further research on this new problem, we open-source our entire meta-learning system, benchmark environment, and testbed datasets.
arXiv Detail & Related papers (2020-09-22T15:14:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.