Data-SUITE: Data-centric identification of in-distribution incongruous
examples
- URL: http://arxiv.org/abs/2202.08836v2
- Date: Fri, 18 Feb 2022 16:31:51 GMT
- Title: Data-SUITE: Data-centric identification of in-distribution incongruous
examples
- Authors: Nabeel Seedat, Jonathan Crabb\'e, Mihaela van der Schaar
- Abstract summary: Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
- Score: 81.21462458089142
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Systematic quantification of data quality is critical for consistent model
performance. Prior works have focused on out-of-distribution data. Instead, we
tackle an understudied yet equally important problem of characterizing
incongruous regions of in-distribution (ID) data, which may arise from feature
space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE:
a data-centric framework to identify these regions, independent of a
task-specific model. DATA-SUITE leverages copula modeling, representation
learning, and conformal prediction to build feature-wise confidence interval
estimators based on a set of training instances. These estimators can be used
to evaluate the congruence of test instances with respect to the training set,
to answer two practically useful questions: (1) which test instances will be
reliably predicted by a model trained with the training instances? and (2) can
we identify incongruous regions of the feature space so that data owners
understand the data's limitations or guide future data collection? We
empirically validate Data-SUITE's performance and coverage guarantees and
demonstrate on cross-site medical data, biased data, and data with concept
drift, that Data-SUITE best identifies ID regions where a downstream model may
be reliable (independent of said model). We also illustrate how these
identified regions can provide insights into datasets and highlight their
limitations.
Related papers
- Towards a Theoretical Understanding of Memorization in Diffusion Models [76.85077961718875]
Diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI)
We provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence.
We propose a novel data extraction method named textbfSurrogate condItional Data Extraction (SIDE) that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs.
arXiv Detail & Related papers (2024-10-03T13:17:06Z) - DAGnosis: Localized Identification of Data Inconsistencies using
Structures [73.39285449012255]
Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models.
We use directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure.
Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions.
arXiv Detail & Related papers (2024-02-26T11:29:16Z) - Example-Based Explainable AI and its Application for Remote Sensing
Image Classification [0.0]
We show an example of an instance in a training dataset that is similar to the input data to be inferred.
Using a remote sensing image dataset from the Sentinel-2 satellite, the concept was successfully demonstrated.
arXiv Detail & Related papers (2023-02-03T03:48:43Z) - Uncertainty in Contrastive Learning: On the Predictability of Downstream
Performance [7.411571833582691]
We study whether the uncertainty of such a representation can be quantified for a single datapoint in a meaningful way.
We show that this goal can be achieved by directly estimating the distribution of the training data in the embedding space.
arXiv Detail & Related papers (2022-07-19T15:44:59Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Managing dataset shift by adversarial validation for credit scoring [5.560471251954645]
The inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance.
We propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios.
arXiv Detail & Related papers (2021-12-19T07:07:15Z) - Robust Fairness under Covariate Shift [11.151913007808927]
Making predictions that are fair with regard to protected group membership has become an important requirement for classification algorithms.
We propose an approach that obtains the predictor that is robust to the worst-case in terms of target performance.
arXiv Detail & Related papers (2020-10-11T04:42:01Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.