Related papers: A Consequentialist Critique of Binary Classification Evaluation Practices

A Consequentialist Critique of Binary Classification Evaluation Practices

URL: http://arxiv.org/abs/2504.04528v1
Date: Sun, 06 Apr 2025 15:58:01 GMT
Title: A Consequentialist Critique of Binary Classification Evaluation Practices
Authors: Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A Fukuyama, Ashia C. Wilson,
Abstract summary: We show a preference for top-K metrics or fixed thresholds in evaluations at major conferences like ICML, FAccT, and CHIL.<n>We use this decision-theoretic framework to map evaluation metrics to their optimal use cases, along with a Python package, briertools, to promote the broader adoption of Brier scores.
Score: 4.603739046972463
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: ML-supported decisions, such as ordering tests or determining preventive custody, often involve binary classification based on probabilistic forecasts. Evaluation frameworks for such forecasts typically consider whether to prioritize independent-decision metrics (e.g., Accuracy) or top-K metrics (e.g., Precision@K), and whether to focus on fixed thresholds or threshold-agnostic measures like AUC-ROC. We highlight that a consequentialist perspective, long advocated by decision theorists, should naturally favor evaluations that support independent decisions using a mixture of thresholds given their prevalence, such as Brier scores and Log loss. However, our empirical analysis reveals a strong preference for top-K metrics or fixed thresholds in evaluations at major conferences like ICML, FAccT, and CHIL. To address this gap, we use this decision-theoretic framework to map evaluation metrics to their optimal use cases, along with a Python package, briertools, to promote the broader adoption of Brier scores. In doing so, we also uncover new theoretical connections, including a reconciliation between the Brier Score and Decision Curve Analysis, which clarifies and responds to a longstanding critique by (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.

Related papers

Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs [3.299877799532224]
We propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers.<n>We derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance.<n>The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
arXiv Detail & Related papers (2025-06-17T14:01:39Z)
Treatment Effect Estimation for Optimal Decision-Making [65.30942348196443]
We study optimal decision-making based on two-stage CATE estimators.<n>We propose a novel two-stage learning objective that retargets the CATE to balance CATE estimation error and decision performance.
arXiv Detail & Related papers (2025-05-19T13:24:57Z)
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds [0.0]
We propose SoftPQ, a flexible and interpretable instance segmentation metric.<n>We show that SoftPQ captures meaningful differences in segmentation quality that existing metrics overlook.
arXiv Detail & Related papers (2025-05-17T22:08:33Z)
Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-Label Classification [120.37051160567277]
This paper proposes a novel measure named Top-K Pairwise Ranking (TKPR) A series of analyses show that TKPR is compatible with existing ranking-based measures. On the other hand, we establish a sharp generalization bound for the proposed framework based on a novel technique named data-dependent contraction.
arXiv Detail & Related papers (2024-07-09T09:36:37Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE) QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences. Our method is especially suitable for problems with well-specified likelihoods. We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z)
Identification and multiply robust estimation in causal mediation analysis across principal strata [7.801213477601286]
We consider assessing causal mediation in the presence of a post-treatment event. We derive the efficient influence function for each mediation estimand, which motivates a set of multiply robust estimators for inference.
arXiv Detail & Related papers (2023-04-20T00:39:20Z)
Orthogonal Series Estimation for the Ratio of Conditional Expectation Functions [2.855485723554975]
This chapter develops the general framework for estimation and inference on conditional expectation functions (CEFR) We derive the pointwise and uniform results for estimation and inference on CEFR, including the validity of the Gaussian bootstrap. We apply the proposed method to estimate the causal effect of the 401(k) program on household assets.
arXiv Detail & Related papers (2022-12-26T13:01:17Z)
Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding [2.8498944632323755]
We propose a unified framework for the robust design and evaluation of predictive algorithms in selectively observed data. We impose general assumptions on how much the outcome may vary on average between unselected and selected units. We develop debiased machine learning estimators for the bounds on a large class of predictive performance estimands.
arXiv Detail & Related papers (2022-12-19T20:41:44Z)
Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects [24.258855352542096]
We propose rank-weighted average treatment effect metrics as a simple and general family of metrics for comparing and testing the quality of treatment prioritization rules. RATE metrics are agnostic to how the prioritization rules were derived, and only assess how well they identify individuals that benefit the most from treatment. We showcase RATE in the context of a number of applications, including optimal targeting of aspirin to stroke patients.
arXiv Detail & Related papers (2021-11-15T18:22:35Z)
Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors. Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP) We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z)
Deconfounding Scores: Feature Representations for Causal Effect Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation. We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data. In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z)
Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation [4.7096631717710045]
Optimal performance is critical for decision-making tasks from medicine to autonomous driving. Measures such as accuracy, sensitivity or the F1 score are measures at a single threshold that reflect an individual single probability or predicted risk. We propose a method in between, deep ROC analysis, that examines groups of probabilities or predicted risks for more insightful analysis.
arXiv Detail & Related papers (2021-03-21T10:27:35Z)
Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap. We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)
Invariant Rationalization [84.1861516092232]
A typical rationalization criterion, i.e. maximum mutual information (MMI), finds the rationale that maximizes the prediction performance based only on the rationale. We introduce a game-theoretic invariant rationalization criterion where the rationales are constrained to enable the same predictor to be optimal across different environments. We show both theoretically and empirically that the proposed rationales can rule out spurious correlations, generalize better to different test scenarios, and align better with human judgments.
arXiv Detail & Related papers (2020-03-22T00:50:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.