Predicting Performance for Natural Language Processing Tasks
- URL: http://arxiv.org/abs/2005.00870v1
- Date: Sat, 2 May 2020 16:02:18 GMT
- Title: Predicting Performance for Natural Language Processing Tasks
- Authors: Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham
Neubig
- Abstract summary: We build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input.
Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures.
- Score: 128.34208911925424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the complexity of combinations of tasks, languages, and domains in
natural language processing (NLP) research, it is computationally prohibitive
to exhaustively test newly proposed models on each possible experimental
setting. In this work, we attempt to explore the possibility of gaining
plausible judgments of how well an NLP model can perform under an experimental
setting, without actually training or testing the model. To do so, we build
regression models to predict the evaluation score of an NLP experiment given
the experimental settings as input. Experimenting on 9 different NLP tasks, we
find that our predictors can produce meaningful predictions over unseen
languages and different modeling architectures, outperforming reasonable
baselines as well as human experts. Going further, we outline how our predictor
can be used to find a small subset of representative experiments that should be
run in order to obtain plausible predictions for all other experimental
settings.
Related papers
- Prediction-Guided Active Experiments [18.494123886098215]
We introduce a new framework for active experimentation, the Prediction-Guided Active Experiment (PGAE)
PGAE leverages predictions from an existing machine learning model to guide sampling and experimentation.
We show that PGAE remains efficient and attains the same semi-parametric bound under certain regularity assumptions.
arXiv Detail & Related papers (2024-11-18T20:16:24Z) - Evaluating Alternative Training Interventions Using Personalized Computational Models of Learning [0.0]
evaluating different training interventions to determine which produce the best learning outcomes is one of the main challenges faced by instructional designers.
We present an approach for automatically tuning models to specific individuals and show that personalized models make better predictions of students' behavior than generic ones.
Our approach makes predictions that align with previous human findings, as well as testable predictions that might be evaluated with future human experiments.
arXiv Detail & Related papers (2024-08-24T22:51:57Z) - Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning [6.230721646014307]
We give a model of how to infer natural language rules by doing experiments.
The model integrates Large Language Models (LLMs) with Monte Carlo algorithms for probabilistic inference.
arXiv Detail & Related papers (2024-02-08T19:57:29Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - How Predictable Are Large Language Model Capabilities? A Case Study on
BIG-bench [52.11481619456093]
We study the performance prediction problem on experiment records from BIG-bench.
An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records.
We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
arXiv Detail & Related papers (2023-05-24T09:35:34Z) - Online simulator-based experimental design for cognitive model selection [74.76661199843284]
We propose BOSMOS: an approach to experimental design that can select between computational models without tractable likelihoods.
In simulated experiments, we demonstrate that the proposed BOSMOS technique can accurately select models in up to 2 orders of magnitude less time than existing LFI alternatives.
arXiv Detail & Related papers (2023-03-03T21:41:01Z) - On the Importance of Application-Grounded Experimental Design for
Evaluating Explainable ML Methods [20.2027063607352]
We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting.
Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results.
We believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
arXiv Detail & Related papers (2022-06-24T14:46:19Z) - Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks.
First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU.
Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z) - Efficient Adaptive Experimental Design for Average Treatment Effect
Estimation [18.027128141189355]
We propose an algorithm for efficient experiments with estimators constructed from dependent samples.
To justify our proposed approach, we provide finite and infinite sample analyses.
arXiv Detail & Related papers (2020-02-13T02:04:17Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.