Related papers: Show Your Work with Confidence: Confidence Bands for Tuning Curves

Show Your Work with Confidence: Confidence Bands for Tuning Curves

URL: http://arxiv.org/abs/2311.09480v2
Date: Mon, 8 Apr 2024 18:21:53 GMT
Title: Show Your Work with Confidence: Confidence Bands for Tuning Curves
Authors: Nicholas Lourie, Kyunghyun Cho, He He,
Abstract summary: tuning curves plot validation performance as a function of tuning effort. We present the first method to construct valid confidence bands for tuning curves. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method.
Score: 51.12106543561089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release opda: an easy-to-use library that you can install with pip. https://github.com/nicholaslourie/opda

Related papers

Pearls from Pebbles: Improved Confidence Functions for Auto-labeling [51.44986105969375]
threshold-based auto-labeling (TBAL) works by finding a threshold on a model's confidence scores above which it can accurately label unlabeled data points. We propose a framework for studying the emphoptimal TBAL confidence function. We develop a new post-hoc method specifically designed to maximize performance in TBAL systems.
arXiv Detail & Related papers (2024-04-24T20:22:48Z)
Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification. We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate. We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z)
Sample and Predict Your Latent: Modality-free Sequential Disentanglement via Contrastive Estimation [2.7759072740347017]
We introduce a self-supervised sequential disentanglement framework based on contrastive estimation with no external signals. In practice, we propose a unified, efficient, and easy-to-code sampling strategy for semantically similar and dissimilar views of the data. Our method presents state-of-the-art results in comparison to existing techniques.
arXiv Detail & Related papers (2023-05-25T10:50:30Z)
Sequential Kernelized Independence Testing [101.22966794822084]
We design sequential kernelized independence tests inspired by kernelized dependence measures. We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
Catoni-style Confidence Sequences under Infinite Variance [19.61346221428679]
We provide an extension of confidence sequences for settings where the variance of the data-generating distribution does not exist or is infinite. Confidence sequences furnish confidence intervals that are valid at arbitrary data-dependent stopping times. The derived results are shown to better than confidence sequences obtained using Dubins-Savage inequality.
arXiv Detail & Related papers (2022-08-05T14:11:06Z)
Comparing Sequential Forecasters [35.38264087676121]
Consider two forecasters, each making a single prediction for a sequence of events over time. How might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated? We present novel sequential inference procedures for estimating the time-varying difference in forecast scores. We empirically validate our approaches by comparing real-world baseball and weather forecasters.
arXiv Detail & Related papers (2021-09-30T22:54:46Z)
Parametric Bootstrap for Differentially Private Confidence Intervals [8.781431682774484]
We develop a practical and general-purpose approach to construct confidence intervals for differentially private parametric estimation. We find that the parametric bootstrap is a simple and effective solution.
arXiv Detail & Related papers (2020-06-14T00:08:19Z)
Showing Your Work Doesn't Always Work [73.63200097493576]
"Show Your Work: Improved Reporting of Experimental Results" advocates for reporting the expected validation effectiveness of the best-tuned model. We analytically show that their estimator is biased and uses error-prone assumptions. We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation.
arXiv Detail & Related papers (2020-04-28T17:59:01Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.