The Dataset Multiplicity Problem: How Unreliable Data Impacts
Predictions
- URL: http://arxiv.org/abs/2304.10655v1
- Date: Thu, 20 Apr 2023 21:31:15 GMT
- Title: The Dataset Multiplicity Problem: How Unreliable Data Impacts
Predictions
- Authors: Anna P. Meyer, Aws Albarghouthi, Loris D'Antoni
- Abstract summary: We introduce dataset multiplicity, a way to study how inaccuracies, uncertainty, and social bias in training datasets impact test-time predictions.
We discuss how to use this framework to encapsulate various sources of uncertainty in datasets' factualness.
Our empirical analysis shows that real-world datasets, under reasonable assumptions, contain many test samples whose predictions are affected by dataset multiplicity.
- Score: 12.00314910031517
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce dataset multiplicity, a way to study how inaccuracies,
uncertainty, and social bias in training datasets impact test-time predictions.
The dataset multiplicity framework asks a counterfactual question of what the
set of resultant models (and associated test-time predictions) would be if we
could somehow access all hypothetical, unbiased versions of the dataset. We
discuss how to use this framework to encapsulate various sources of uncertainty
in datasets' factualness, including systemic social bias, data collection
practices, and noisy labels or features. We show how to exactly analyze the
impacts of dataset multiplicity for a specific model architecture and type of
uncertainty: linear models with label errors. Our empirical analysis shows that
real-world datasets, under reasonable assumptions, contain many test samples
whose predictions are affected by dataset multiplicity. Furthermore, the choice
of domain-specific dataset multiplicity definition determines what samples are
affected, and whether different demographic groups are disparately impacted.
Finally, we discuss implications of dataset multiplicity for machine learning
practice and research, including considerations for when model outcomes should
not be trusted.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - A Language Model-Guided Framework for Mining Time Series with Distributional Shifts [5.082311792764403]
This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets.
While obtained from external sources, the collected data share critical statistical properties with primary time series datasets.
It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution.
arXiv Detail & Related papers (2024-06-07T20:21:07Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Data-driven Model Generalizability in Crosslinguistic Low-resource
Morphological Segmentation [4.339613097080119]
In low-resource scenarios, artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental.
We compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families.
The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size.
arXiv Detail & Related papers (2022-01-05T22:19:10Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z) - Meta Learning for Causal Direction [29.00522306460408]
We introduce a novel generative model that allows distinguishing cause and effect in the small data setting.
We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes.
arXiv Detail & Related papers (2020-07-06T15:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.