Towards More Fine-grained and Reliable NLP Performance Prediction
- URL: http://arxiv.org/abs/2102.05486v1
- Date: Wed, 10 Feb 2021 15:23:20 GMT
- Title: Towards More Fine-grained and Reliable NLP Performance Prediction
- Authors: Zihuiwen Ye, Pengfei Liu, Jinlan Fu, Graham Neubig
- Abstract summary: We make two contributions to improving performance prediction for NLP tasks.
First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU.
Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
- Score: 85.78131503006193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performance prediction, the task of estimating a system's performance without
performing experiments, allows us to reduce the experimental burden caused by
the combinatorial explosion of different datasets, languages, tasks, and
models. In this paper, we make two contributions to improving performance
prediction for NLP tasks. First, we examine performance predictors not only for
holistic measures of accuracy like F1 or BLEU but also fine-grained performance
measures such as accuracy over individual classes of examples. Second, we
propose methods to understand the reliability of a performance prediction model
from two angles: confidence intervals and calibration. We perform an analysis
of four types of NLP tasks, and both demonstrate the feasibility of
fine-grained performance prediction and the necessity to perform reliability
analysis for performance prediction methods in the future. We make our code
publicly available: \url{https://github.com/neulab/Reliable-NLPPP}
Related papers
- Can We Predict Performance of Large Models across Vision-Language Tasks? [34.27319941609499]
We propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks.
We use a sparse performance matrix $boldsymbolR$, where each entry $R_mn$ represents the performance score of the $m$-th model on the $n$-th dataset.
We demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
arXiv Detail & Related papers (2024-10-14T03:00:12Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Forecast-PEFT: Parameter-Efficient Fine-Tuning for Pre-trained Motion Forecasting Models [68.23649978697027]
Forecast-PEFT is a fine-tuning strategy that freezes the majority of the model's parameters, focusing adjustments on newly introduced prompts and adapters.
Our experiments show that Forecast-PEFT outperforms traditional full fine-tuning methods in motion prediction tasks.
Forecast-FT further improves prediction performance, evidencing up to a 9.6% enhancement over conventional baseline methods.
arXiv Detail & Related papers (2024-07-28T19:18:59Z) - Uncertainty-Aware Performance Prediction for Highly Configurable
Software Systems via Bayesian Neural Networks [12.607426130997336]
We propose a Bayesian deep learning based method, namely BDLPerf, that can incorporate uncertainty into the prediction model.
We develop a novel uncertainty calibration technique to ensure the reliability of the confidence intervals generated by a Bayesian prediction model.
Our experimental results on 10 real-world systems show that BDLPerf achieves higher accuracy than existing approaches.
arXiv Detail & Related papers (2022-12-27T04:39:26Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic
Regression [51.770998056563094]
Probabilistic Gradient Boosting Machines (PGBM) is a method to create probabilistic predictions with a single ensemble of decision trees.
We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-03T08:32:13Z) - Learning Prediction Intervals for Model Performance [1.433758865948252]
We propose a method to compute prediction intervals for model performance.
We evaluate our approach across a wide range of drift conditions and show substantial improvement over competitive baselines.
arXiv Detail & Related papers (2020-12-15T21:32:03Z) - Towards Improving Selective Prediction Ability of NLP Systems [24.774450633678125]
We propose a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances.
We instantiate our method with Natural Language Inference (NLI) and Duplicate Detection (DD) tasks and evaluate it in both In-Domain (IID) and Out-of-Domain (OOD) settings.
arXiv Detail & Related papers (2020-08-21T08:46:36Z) - Robust Validation: Confident Predictions Even When Distributions Shift [19.327409270934474]
We describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions.
We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an $f$-divergence ball around the training population.
An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it.
arXiv Detail & Related papers (2020-08-10T17:09:16Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.