Do We Really Even Need Data?
- URL: http://arxiv.org/abs/2401.08702v2
- Date: Fri, 2 Feb 2024 23:14:09 GMT
- Title: Do We Really Even Need Data?
- Authors: Kentaro Hoffman, Stephen Salerno, Awan Afiaz, Jeffrey T. Leek, Tyler
H. McCormick
- Abstract summary: Researchers increasingly use predictions from pre-trained algorithms as outcome variables.
Standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value.
- Score: 2.3749120526936465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As artificial intelligence and machine learning tools become more accessible,
and scientists face new obstacles to data collection (e.g. rising costs,
declining survey response rates), researchers increasingly use predictions from
pre-trained algorithms as outcome variables. Though appealing for financial and
logistical reasons, using standard tools for inference can misrepresent the
association between independent variables and the outcome of interest when the
true, unobserved outcome is replaced by a predicted value. In this paper, we
characterize the statistical challenges inherent to this so-called ``inference
with predicted data'' problem and elucidate three potential sources of error:
(i) the relationship between predicted outcomes and their true, unobserved
counterparts, (ii) robustness of the machine learning model to resampling or
uncertainty about the training data, and (iii) appropriately propagating not
just bias but also uncertainty from predictions into the ultimate inference
procedure.
Related papers
- Learning Latent Graph Structures and their Uncertainty [63.95971478893842]
Graph Neural Networks (GNNs) use relational information as an inductive bias to enhance the model's accuracy.
As task-relevant relations might be unknown, graph structure learning approaches have been proposed to learn them while solving the downstream prediction task.
arXiv Detail & Related papers (2024-05-30T10:49:22Z) - Multi-Source Conformal Inference Under Distribution Shift [41.701790856201036]
We consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources.
We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations.
We propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction.
arXiv Detail & Related papers (2024-05-15T13:33:09Z) - Fair Generalized Linear Mixed Models [0.0]
Fairness in machine learning aims to ensure that biases in the data and model inaccuracies do not lead to discriminatory decisions.
We present an algorithm that can handle both problems simultaneously.
arXiv Detail & Related papers (2024-05-15T11:42:41Z) - Cross-Prediction-Powered Inference [15.745692520785074]
Cross-prediction is a method for valid inference powered by machine learning.
We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference.
arXiv Detail & Related papers (2023-09-28T17:01:58Z) - Quantification of Predictive Uncertainty via Inference-Time Sampling [57.749601811982096]
We propose a post-hoc sampling strategy for estimating predictive uncertainty accounting for data ambiguity.
The method can generate different plausible outputs for a given input and does not assume parametric forms of predictive distributions.
arXiv Detail & Related papers (2023-08-03T12:43:21Z) - Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks.
The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data.
Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z) - Is augmentation effective to improve prediction in imbalanced text
datasets? [3.1690891866882236]
We argue that adjusting the cutoffs without data augmentation can produce similar results to oversampling techniques.
Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data.
arXiv Detail & Related papers (2023-04-20T13:07:31Z) - Prediction-Powered Inference [68.97619568620709]
Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system.
The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients.
Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning.
arXiv Detail & Related papers (2023-01-23T18:59:28Z) - Double Robust Representation Learning for Counterfactual Prediction [68.78210173955001]
We propose a novel scalable method to learn double-robust representations for counterfactual predictions.
We make robust and efficient counterfactual predictions for both individual and average treatment effects.
The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
arXiv Detail & Related papers (2020-10-15T16:39:26Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.