Measuring Model Performance in the Presence of an Intervention
- URL: http://arxiv.org/abs/2511.05805v2
- Date: Fri, 14 Nov 2025 20:24:12 GMT
- Title: Measuring Model Performance in the Presence of an Intervention
- Authors: Winston Chen, Michael W. Sjoding, Jenna Wiens,
- Abstract summary: In many AI for social impact applications, the presence of an intervention that affects the outcome can bias the evaluation.<n>RCTs randomly assign interventions, allowing data from the control group to be used for unbiased model evaluation.<n>We propose nuisance parameter weighting (NPW), an unbiased model evaluation approach that reweights data from the treatment group to mimic the distributions of samples that would or would not experience the outcome.
- Score: 11.381587523287495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI models are often evaluated based on their ability to predict the outcome of interest. However, in many AI for social impact applications, the presence of an intervention that affects the outcome can bias the evaluation. Randomized controlled trials (RCTs) randomly assign interventions, allowing data from the control group to be used for unbiased model evaluation. However, this approach is inefficient because it ignores data from the treatment group. Given the complexity and cost often associated with RCTs, making the most use of the data is essential. Thus, we investigate model evaluation strategies that leverage all data from an RCT. First, we theoretically quantify the estimation bias that arises from naïvely aggregating performance estimates from treatment and control groups and derive the condition under which this bias leads to incorrect model selection. Leveraging these theoretical insights, we propose nuisance parameter weighting (NPW), an unbiased model evaluation approach that reweights data from the treatment group to mimic the distributions of samples that would or would not experience the outcome under no intervention. Using synthetic and real-world datasets, we demonstrate that our proposed evaluation approach consistently yields better model selection than the standard approach, which ignores data from the treatment group, across various intervention effect and sample size settings. Our contribution represents a meaningful step towards more efficient model evaluation in real-world contexts.
Related papers
- Detecting and Mitigating Group Bias in Heterogeneous Treatment Effects [28.4891545570248]
We develop a statistical framework to detect and mitigate group bias in randomized experiments.<n>For mitigation, we propose a shrinkage-based bias-correction, and show that the theoretically optimal and empirically feasible solutions have closed-form expressions.<n>We analyze the economic implications of mitigating detected group bias for profit-maximizing personalized targeting.
arXiv Detail & Related papers (2026-02-23T21:47:01Z) - Robust estimation of heterogeneous treatment effects in randomized trials leveraging external data [4.777323087050061]
We propose the QR-learner, a model-agnostic learner that estimates conditional average treatment effects (CATE) within a trial population.<n>It can reduce the mean squared error relative to a trial-only CATE learner, and is guaranteed to recover the true CATE even when the external data are not aligned with the trial.
arXiv Detail & Related papers (2025-07-04T16:01:05Z) - Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness [49.35494016290887]
We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of relevant populations but reflective of real-world disparities.<n>Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift.
arXiv Detail & Related papers (2025-06-04T17:40:31Z) - Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models.<n>We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results)<n>Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z) - Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.<n>This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.<n>We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.<n>To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z) - Estimating Conditional Average Treatment Effects via Sufficient Representation Learning [31.822980052107496]
This paper proposes a novel neural network approach named textbfCrossNet to learn a sufficient representation for the features, based on which we then estimate the conditional average treatment effects (CATE)<n> Numerical simulations and empirical results demonstrate that our method outperforms the competitive approaches.
arXiv Detail & Related papers (2024-08-30T07:23:59Z) - Efficient adjustment for complex covariates: Gaining efficiency with
DOPE [56.537164957672715]
We propose a framework that accommodates adjustment for any subset of information expressed by the covariates.
Based on our theoretical results, we propose the Debiased Outcome-adapted Propensity Estorimator (DOPE) for efficient estimation of the average treatment effect (ATE)
Our results show that the DOPE provides an efficient and robust methodology for ATE estimation in various observational settings.
arXiv Detail & Related papers (2024-02-20T13:02:51Z) - Estimating treatment effects from single-arm trials via latent-variable
modeling [14.083487062917085]
Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group.
We propose an identifiable deep latent-variable model for this scenario.
Our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.
arXiv Detail & Related papers (2023-11-06T10:12:54Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation [24.65301562548798]
We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation.
We conduct an empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work.
arXiv Detail & Related papers (2022-11-03T16:26:06Z) - Double machine learning for sample selection models [0.12891210250935145]
This paper considers the evaluation of discretely distributed treatments when outcomes are only observed for a subpopulation due to sample selection or outcome attrition.
We make use of (a) Neyman-orthogonal, doubly robust, and efficient score functions, which imply the robustness of treatment effect estimation to moderate regularization biases in the machine learning-based estimation of the outcome, treatment, or sample selection models and (b) sample splitting (or cross-fitting) to prevent overfitting bias.
arXiv Detail & Related papers (2020-11-30T19:40:21Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z) - Learning Overlapping Representations for the Estimation of
Individualized Treatment Effects [97.42686600929211]
Estimating the likely outcome of alternatives from observational data is a challenging problem.
We show that algorithms that learn domain-invariant representations of inputs are often inappropriate.
We develop a deep kernel regression algorithm and posterior regularization framework that substantially outperforms the state-of-the-art on a variety of benchmarks data sets.
arXiv Detail & Related papers (2020-01-14T12:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.