Related papers: Aligning the Evaluation of Probabilistic Predictions with Downstream Value

Aligning the Evaluation of Probabilistic Predictions with Downstream Value

URL: http://arxiv.org/abs/2508.18251v1
Date: Mon, 25 Aug 2025 17:41:27 GMT
Title: Aligning the Evaluation of Probabilistic Predictions with Downstream Value
Authors: Novin Shahroudi, Viacheslav Komisarenko, Meelis Kull,
Abstract summary: Metrics based solely on predictive performance often diverge from measures of real-world downstream impact.<n>We propose a data-driven method to learn a proxy evaluation function aligned with the downstream evaluation.<n>Our approach leverages weighted scoring rules parametrized by a neural network, where weighting is learned to align with the performance in the downstream task.
Score: 2.6636053598505307
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Every prediction is ultimately used in a downstream task. Consequently, evaluating prediction quality is more meaningful when considered in the context of its downstream use. Metrics based solely on predictive performance often diverge from measures of real-world downstream impact. Existing approaches incorporate the downstream view by relying on multiple task-specific metrics, which can be burdensome to analyze, or by formulating cost-sensitive evaluations that require an explicit cost structure, typically assumed to be known a priori. We frame this mismatch as an evaluation alignment problem and propose a data-driven method to learn a proxy evaluation function aligned with the downstream evaluation. Building on the theory of proper scoring rules, we explore transformations of scoring rules that ensure the preservation of propriety. Our approach leverages weighted scoring rules parametrized by a neural network, where weighting is learned to align with the performance in the downstream task. This enables fast and scalable evaluation cycles across tasks where the weighting is complex or unknown a priori. We showcase our framework through synthetic and real-data experiments for regression tasks, demonstrating its potential to bridge the gap between predictive evaluation and downstream utility in modular prediction systems.

Related papers

Geometric Data Valuation via Leverage Scores [0.2538209532048866]
We propose a geometric alternative to Shapley data valuation based on statistical leverage scores.<n>We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation.<n>We also show that training on a leverage-sampled subset produces a model whose parameters and predictive risk are within $O(varepsilon)$ of the full-data optimum.
arXiv Detail & Related papers (2025-11-03T22:20:50Z)
Adversary-Free Counterfactual Prediction via Information-Regularized Representations [8.760019957506719]
We study counterfactual prediction under decoder bias and propose a mathematically grounded, information-theoretic approach.<n>We derive a tractable variational objective that upper-bounds the information term and couples it with a supervised assignment, yielding a stable, provably motivated training criterion.<n>We evaluate the method on controlled numerical simulations and a real-world clinical dataset, comparing against recent state-of-the-art balancing, reweighting, and adversarial baselines.
arXiv Detail & Related papers (2025-10-17T09:49:04Z)
Multiply Robust Conformal Risk Control with Coarsened Data [0.0]
Conformal Prediction (CP) has recently received a tremendous amount of interest.<n>In this paper, we consider the general problem of obtaining distribution-free valid prediction regions for an outcome given coarsened data.<n>Our principled use of semiparametric theory has the key advantage of facilitating flexible machine learning methods.
arXiv Detail & Related papers (2025-08-21T12:14:44Z)
A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions [60.06461883533697]
We first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.<n>We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.<n>Our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
arXiv Detail & Related papers (2024-05-02T13:48:37Z)
Has the Deep Neural Network learned the Stochastic Process? An Evaluation Viewpoint [17.897121328003617]
This paper presents the first systematic study of evaluating Deep Neural Networks (DNNs)<n>We show that traditional evaluation methods assess a DNN's ability to replicate the observed ground truth but fail to measure the underlying process.<n>We propose a new evaluation criterion called Fidelity toGT Process (F2SP)
arXiv Detail & Related papers (2024-02-23T07:54:20Z)
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive? [93.10694819127608]
We propose a unified evaluation pipeline for forecasting methods with real-world perception inputs. Our in-depth study uncovers a substantial performance gap when transitioning from curated to perception-based data.
arXiv Detail & Related papers (2023-06-15T17:03:14Z)
Improving Adaptive Conformal Prediction Using Self-Supervised Learning [72.2614468437919]
We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores. We empirically demonstrate the benefit of the additional information using both synthetic and real data on the efficiency (width), deficit, and excess of conformal prediction intervals.
arXiv Detail & Related papers (2023-02-23T18:57:14Z)
Post Reinforcement Learning Inference [22.117487428829488]
We consider estimation and inference using data collected from reinforcement learning algorithms.<n>We propose a weighted Z-estimation approach with carefully designed adaptive weights to stabilize the time-varying variance.<n>Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation.
arXiv Detail & Related papers (2023-02-17T12:53:15Z)
Local Evaluation of Time Series Anomaly Detection Algorithms [9.717823994163277]
We show that an adversary algorithm can reach high precision and recall on almost any dataset under weak assumption. We propose a theoretically grounded, robust, parameter-free and interpretable extension to precision/recall metrics.
arXiv Detail & Related papers (2022-06-27T10:18:41Z)
Evaluating Predictive Distributions: Does Bayesian Deep Learning Work? [45.290773422944866]
Posterior predictive distributions quantify uncertainties ignored by point estimates. This paper introduces textitThe Neural Testbed, which provides tools for the systematic evaluation of agents that generate such predictions.
arXiv Detail & Related papers (2021-10-09T18:54:02Z)
Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation. Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z)
Combining Task Predictors via Enhancing Joint Predictability [53.46348489300652]
We present a new predictor combination algorithm that improves the target by i) measuring the relevance of references based on their capabilities in predicting the target, and ii) strengthening such estimated relevance. Our algorithm jointly assesses the relevance of all references by adopting a Bayesian framework. Based on experiments on seven real-world datasets from visual attribute ranking and multi-class classification scenarios, we demonstrate that our algorithm offers a significant performance gain and broadens the application range of existing predictor combination approaches.
arXiv Detail & Related papers (2020-07-15T21:58:39Z)
Performance metrics for intervention-triggering prediction models do not reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models. Standard metrics calculated from retrospective data are only related to model utility under certain assumptions. When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.