Related papers: Binary Quantification and Dataset Shift: An Experimental Investigation

Binary Quantification and Dataset Shift: An Experimental Investigation

URL: http://arxiv.org/abs/2310.04565v1
Date: Fri, 6 Oct 2023 20:11:27 GMT
Title: Binary Quantification and Dataset Shift: An Experimental Investigation
Authors: Pablo Gonz\'alez and Alejandro Moreo and Fabrizio Sebastiani
Abstract summary: Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data. The relationship between quantification and other types of dataset shift remains, by and large, unexplored. We propose a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift.
Score: 54.14283123210872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.

Related papers

A Dataset for Semantic Segmentation in the Presence of Unknowns [49.795683850385956]
Existing datasets allow evaluation of only knowns or unknowns - but not both. We propose a novel anomaly segmentation dataset, ISSU, that features a diverse set of anomaly inputs from cluttered real-world environments. The dataset is twice larger than existing anomaly segmentation datasets.
arXiv Detail & Related papers (2025-03-28T10:31:01Z)
PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series [0.01874930567916036]
Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools. Our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature.
arXiv Detail & Related papers (2024-11-21T09:03:12Z)
Automatic dataset shift identification to support root cause analysis of AI performance drift [13.996602963045387]
Shifts in data distribution can substantially harm the performance of clinical AI models. We propose the first unsupervised dataset shift identification framework. We report promising results for the proposed framework on five types of real-world dataset shifts.
arXiv Detail & Related papers (2024-11-12T17:09:20Z)
Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z)
Adversarial Learning for Feature Shift Detection and Correction [45.65548560695731]
Feature shifts can occur in many datasets, including in multi-sensor data, where some sensors are malfunctioning, or in structured data, where faulty standardization and data processing pipelines can lead to erroneous features. In this work, we explore using the principles of adversarial learning, where the information from several discriminators trained to distinguish between two distributions is used to both detect the corrupted features and fix them in order to remove the distribution shift between datasets.
arXiv Detail & Related papers (2023-12-07T18:58:40Z)
Time-Varying Propensity Score to Bridge the Gap between the Past and Present [104.46387765330142]
We introduce a time-varying propensity score that can detect gradual shifts in the distribution of data. We demonstrate different ways of implementing it and evaluate it on a variety of problems.
arXiv Detail & Related papers (2022-10-04T07:21:49Z)
A unified framework for dataset shift diagnostics [2.449909275410288]
Supervised learning techniques typically assume training data originates from the target population. Yet, dataset shift frequently arises, which, if not adequately taken into account, may decrease the performance of their predictors. We propose a novel and flexible framework called DetectShift that quantifies and tests for multiple dataset shifts.
arXiv Detail & Related papers (2022-05-17T13:34:45Z)
Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks [44.61070965407907]
Given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. We propose the emphShifts dataset for evaluation of uncertainty estimates and robustness to distributional shift.
arXiv Detail & Related papers (2021-07-15T16:59:34Z)
Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources. We show theoretically that this reduces the variance of the ATE estimate. We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z)
Robust Classification under Class-Dependent Domain Shift [29.54336432319199]
In this paper we explore a special type of dataset shift which we call class-dependent domain shift. It is characterized by the following features: the input data causally depends on the label, the shift in the data is fully explained by a known variable, the variable which controls the shift can depend on the label, there is no shift in the label distribution.
arXiv Detail & Related papers (2020-07-10T12:26:57Z)
Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift. We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness. The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.