Related papers: ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

URL: http://arxiv.org/abs/2510.27263v1
Date: Fri, 31 Oct 2025 08:03:35 GMT
Title: ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction
Authors: Han Yu, Kehan Li, Dongbai Li, Yue He, Xingxuan Zhang, Peng Cui,
Abstract summary: Out-of-Distribution (OOD) performance prediction aims to predict the performance of trained models on unlabeled test datasets.<n>We propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms.<n>We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process.
Score: 29.953921358142477
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols in previous literature are inconsistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.

Related papers

RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z)
Forecasting with Deep Learning: Beyond Average of Average of Average Performance [0.393259574660092]
Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score. We propose a novel framework for evaluating models from multiple perspectives. We show the advantages of this framework by comparing a state-of-the-art deep learning approach with classical forecasting techniques.
arXiv Detail & Related papers (2024-06-24T12:28:22Z)
Do-GOOD: Towards Distribution Shift Evaluation for Pre-Trained Visual Document Understanding Models [68.12229916000584]
We develop an out-of-distribution (OOD) benchmark termed Do-GOOD for the fine-Grained analysis on Document image-related tasks. We then evaluate the robustness and perform a fine-grained analysis of 5 latest VDU pre-trained models and 2 typical OOD generalization algorithms.
arXiv Detail & Related papers (2023-06-05T06:50:42Z)
Prediction-Oriented Bayesian Active Learning [51.426960808684655]
Expected predictive information gain (EPIG) is an acquisition function that measures information gain in the space of predictions rather than parameters. EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models.
arXiv Detail & Related papers (2023-04-17T10:59:57Z)
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z)
Effective Robustness against Natural Distribution Shifts for Models with Different Training Data [113.21868839569]
"Effective robustness" measures the extra out-of-distribution robustness beyond what can be predicted from the in-distribution (ID) performance. We propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data.
arXiv Detail & Related papers (2023-02-02T19:28:41Z)
Towards Realistic Out-of-Distribution Detection: A Novel Evaluation Framework for Improving Generalization in OOD Detection [14.541761912174799]
This paper presents a novel evaluation framework for Out-of-Distribution (OOD) detection. It aims to assess the performance of machine learning models in more realistic settings.
arXiv Detail & Related papers (2022-11-20T07:30:15Z)
How Useful are Gradients for OOD Detection Really? [5.459639971144757]
Out of distribution (OOD) detection is a critical challenge in deploying highly performant machine learning models in real-life applications. We provide an in-depth analysis and comparison of gradient based methods for OOD detection. We propose a general, non-gradient based method of OOD detection which improves over previous baselines in both performance and computational efficiency.
arXiv Detail & Related papers (2022-05-20T21:10:05Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
BEDS-Bench: Behavior of EHR-models under Distributional Shift--A Benchmark [21.040754460129854]
We release BEDS-Bench, a benchmark for quantifying the behavior of ML models over EHR data under OOD settings. We evaluate several learning algorithms under BEDS-Bench and find that all of them show poor generalization performance under distributional shift in general.
arXiv Detail & Related papers (2021-07-17T05:53:24Z)
Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks. First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU. Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z)
Learning Prediction Intervals for Model Performance [1.433758865948252]
We propose a method to compute prediction intervals for model performance. We evaluate our approach across a wide range of drift conditions and show substantial improvement over competitive baselines.
arXiv Detail & Related papers (2020-12-15T21:32:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.