Related papers: No Need for Learning to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction

No Need for Learning to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction

URL: http://arxiv.org/abs/2509.12573v2
Date: Mon, 22 Sep 2025 14:32:27 GMT
Title: No Need for Learning to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction
Authors: Tim Bary, Benoît Macq, Louis Petit,
Abstract summary: We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction.<n>Our method consistently outperforms both the standalone model and the strongest expert.
Score: 3.746889836344766
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI systems often fail to deliver reliable predictions across all inputs, prompting the need for hybrid human-AI decision-making. Existing Learning to Defer (L2D) approaches address this by training deferral models, but these are sensitive to changes in expert composition and require significant retraining if experts change. We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction. Our method uses the prediction set generated by a conformal predictor to identify label-specific uncertainty and selects the most discriminative expert using a segregativity criterion, measuring how well an expert distinguishes between the remaining plausible labels. Experiments on CIFAR10-H and ImageNet16-H show that our method consistently outperforms both the standalone model and the strongest expert, with accuracies attaining $99.57\pm0.10\%$ and $99.40\pm0.52\%$, while reducing expert workload by up to a factor of $11$. The method remains robust under degraded expert performance and shows a gradual performance drop in low-information settings. These results suggest a scalable, retraining-free alternative to L2D for real-world human-AI collaboration.

Related papers

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z)
Budgeted Multiple-Expert Deferral [38.13580998392063]
Training procedures for deferral algorithms typically require querying all experts for every training instance.<n>We introduce the budgeted deferral framework, which aims to train effective deferral algorithms while minimizing expert query costs during training.<n>We propose new algorithms for both two-stage and single-stage multiple-expert deferral settings that selectively query only a subset of experts per training example.
arXiv Detail & Related papers (2025-10-30T17:08:52Z)
Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation [9.817102014355617]
We show that well-calibrated predictive uncertainty estimates can be extracted from a mixture of experts (MoE) without architectural modifications.<n>Our results show that MoEs yield more reliable uncertainty estimates than ensembles in terms of conditional correctness metrics.<n>Our experiments on the Cityscapes dataset suggest that increasing the number of experts can further enhance uncertainty calibration.
arXiv Detail & Related papers (2025-09-05T05:30:53Z)
Unified Sparse Mixture of Experts [14.774596844618396]
Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead.<n>This paper proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations.
arXiv Detail & Related papers (2025-03-29T07:15:12Z)
Learning to Defer for Causal Discovery with Imperfect Experts [59.071731337922664]
We propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results.<n>We evaluate L2D-CD on the canonical T"ubingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation.
arXiv Detail & Related papers (2025-02-18T18:55:53Z)
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection [63.96018203905272]
We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts. We demonstrate the effectiveness of our method, DiffPruning, across several datasets.
arXiv Detail & Related papers (2024-09-23T21:27:26Z)
Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance. We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z)
Expert load matters: operating networks at high accuracy and low manual effort [14.978358577277028]
We argue that deep neural networks should be trained by taking into account both accuracy and expert load. We propose a new complementary loss function for classification that maximizes the area under this COC curve. Our results demonstrate that the proposed loss improves classification accuracy and delegates less number of decisions to experts.
arXiv Detail & Related papers (2023-08-09T16:08:32Z)
Uncertainty-Driven Action Quality Assessment [11.958132175629368]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores.<n>We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss.<n>Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z)
Trustworthy Long-Tailed Classification [41.45744960383575]
We propose a Trustworthy Long-tailed Classification (TLC) method to jointly conduct classification and uncertainty estimation. Our TLC obtains the evidence-based uncertainty (EvU) and evidence for each expert, and then combines these uncertainties and evidences under the Dempster-Shafer Evidence Theory (DST) The experimental results show that the proposed TLC outperforms the state-of-the-art methods and is trustworthy with reliable uncertainty.
arXiv Detail & Related papers (2021-11-17T10:52:36Z)
Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision [85.07855130048951]
We study a more practical task setting, called test-agnostic long-tailed recognition, where the training class distribution is long-tailed. We propose a new method, called Test-time Aggregating Diverse Experts (TADE), that trains diverse experts to excel at handling different test distributions. We theoretically show that our method has provable ability to simulate unknown test class distributions.
arXiv Detail & Related papers (2021-07-20T04:10:31Z)
Gaussian Experts Selection using Graphical Models [7.530615321587948]
Local approximations reduce time complexity by dividing the original dataset into subsets and training a local expert on each subset. We leverage techniques from the literature on undirected graphical models, using sparse precision matrices that encode conditional dependencies between experts to select the most important experts.
arXiv Detail & Related papers (2021-02-02T14:12:11Z)
Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap. We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)
Discriminative Jackknife: Quantifying Uncertainty in Deep Learning via Higher-Order Influence Functions [121.10450359856242]
We develop a frequentist procedure that utilizes influence functions of a model's loss functional to construct a jackknife (or leave-one-out) estimator of predictive confidence intervals. The DJ satisfies (1) and (2), is applicable to a wide range of deep learning models, is easy to implement, and can be applied in a post-hoc fashion without interfering with model training or compromising its accuracy.
arXiv Detail & Related papers (2020-06-29T13:36:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.