Related papers: Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking

Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking

URL: http://arxiv.org/abs/2510.02956v1
Date: Fri, 03 Oct 2025 12:48:11 GMT
Title: Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking
Authors: Weijian Deng, Weijie Tu, Ibrahim Radwan, Mohammad Abu Alsheikh, Stephen Gould, Liang Zheng,
Abstract summary: This paper presents a unified and practical framework for unsupervised model evaluation and ranking.<n>We show that hybrid metrics consistently outperform single-aspect metrics on both dataset-centric and model-centric evaluation settings.
Score: 46.95596181965493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Assessing model generalization under distribution shift is essential for real-world deployment, particularly when labeled test data is unavailable. This paper presents a unified and practical framework for unsupervised model evaluation and ranking in two common deployment settings: (1) estimating the accuracy of a fixed model on multiple unlabeled test sets (dataset-centric evaluation), and (2) ranking a set of candidate models on a single unlabeled test set (model-centric evaluation). We demonstrate that two intrinsic properties of model predictions, namely confidence (which reflects prediction certainty) and dispersity (which captures the diversity of predicted classes), together provide strong and complementary signals for generalization. We systematically benchmark a set of confidence-based, dispersity-based, and hybrid metrics across a wide range of model architectures, datasets, and distribution shift types. Our results show that hybrid metrics consistently outperform single-aspect metrics on both dataset-centric and model-centric evaluation settings. In particular, the nuclear norm of the prediction matrix provides robust and accurate performance across tasks, including real-world datasets, and maintains reliability under moderate class imbalance. These findings offer a practical and generalizable basis for unsupervised model assessment in deployment scenarios.

Related papers

Ensemble-Based Deepfake Detection using State-of-the-Art Models with Robust Cross-Dataset Generalisation [0.0]
Machine learning-based Deepfake detection models have achieved impressive results on benchmark datasets.<n>But their performance often deteriorates significantly when evaluated on out-of-distribution data.<n>In this work, we investigate an ensemble-based approach for improving the generalization of deepfake detection systems.
arXiv Detail & Related papers (2025-07-08T13:54:48Z)
On Large-scale Evaluation of Embedding Models for Knowledge Graph Completion [1.2703808802607108]
Knowledge graph embedding (KGE) models are extensively studied for knowledge graph completion.<n>Standard evaluation metrics rely on the closed-world assumption, which penalizes models for correctly predicting missing triples.<n>This paper conducts a comprehensive evaluation of four representative KGE models on large-scale datasets FB-CVT-REV and FB+CVT-REV.
arXiv Detail & Related papers (2025-04-11T20:49:02Z)
Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models.<n>We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results)<n>Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
Evaluating Predictive Uncertainty and Robustness to Distributional Shift Using Real World Data [0.0]
We propose metrics for general regression tasks using the Shifts Weather Prediction dataset. We also present an evaluation of the baseline methods using these metrics.
arXiv Detail & Related papers (2021-11-08T17:32:10Z)
Test-time Collective Prediction [73.74982509510961]
Multiple parties in machine learning want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents, but may not be willing to release their data or model parameters. We explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model.
arXiv Detail & Related papers (2021-06-22T18:29:58Z)
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
Characterizing Fairness Over the Set of Good Models Under Selective Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance. We provide tractable algorithms to compute the range of attainable group-level predictive disparities. We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.