Related papers: Predicting Performance for Natural Language Processing Tasks

Predicting Performance for Natural Language Processing Tasks

URL: http://arxiv.org/abs/2005.00870v1
Date: Sat, 2 May 2020 16:02:18 GMT
Title: Predicting Performance for Natural Language Processing Tasks
Authors: Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham Neubig
Abstract summary: We build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures.
Score: 128.34208911925424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, without actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures, outperforming reasonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of representative experiments that should be run in order to obtain plausible predictions for all other experimental settings.

Related papers

Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences [56.23412698865433]
We focus on causal inferences on a target experiment with unlabeled factual outcomes, retrieved by a predictive model fine-tuned on a labeled similar experiment. First, we show that factual outcome estimation via Empirical Risk Minimization (ERM) may fail to yield valid causal inferences on the target population. We propose Deconfounded Empirical Risk Minimization (DERM), a new simple learning procedure minimizing the risk over a fictitious target population.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
Prediction-Guided Active Experiments [18.494123886098215]
We introduce a new framework for active experimentation, the Prediction-Guided Active Experiment (PGAE) PGAE leverages predictions from an existing machine learning model to guide sampling and experimentation. We show that PGAE remains efficient and attains the same semi-parametric bound under certain regularity assumptions.
arXiv Detail & Related papers (2024-11-18T20:16:24Z)
Evaluating Alternative Training Interventions Using Personalized Computational Models of Learning [0.0]
evaluating different training interventions to determine which produce the best learning outcomes is one of the main challenges faced by instructional designers. We present an approach for automatically tuning models to specific individuals and show that personalized models make better predictions of students' behavior than generic ones. Our approach makes predictions that align with previous human findings, as well as testable predictions that might be evaluated with future human experiments.
arXiv Detail & Related papers (2024-08-24T22:51:57Z)
Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning [6.230721646014307]
We give a model of how to infer natural language rules by doing experiments. The model integrates Large Language Models (LLMs) with Monte Carlo algorithms for probabilistic inference.
arXiv Detail & Related papers (2024-02-08T19:57:29Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench [52.11481619456093]
We study the performance prediction problem on experiment records from BIG-bench. An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records. We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
arXiv Detail & Related papers (2023-05-24T09:35:34Z)
Online simulator-based experimental design for cognitive model selection [74.76661199843284]
We propose BOSMOS: an approach to experimental design that can select between computational models without tractable likelihoods. In simulated experiments, we demonstrate that the proposed BOSMOS technique can accurately select models in up to 2 orders of magnitude less time than existing LFI alternatives.
arXiv Detail & Related papers (2023-03-03T21:41:01Z)
On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods [20.2027063607352]
We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting. Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results. We believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
arXiv Detail & Related papers (2022-06-24T14:46:19Z)
Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks. First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU. Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z)
Efficient Adaptive Experimental Design for Average Treatment Effect Estimation [18.027128141189355]
We propose an algorithm for efficient experiments with estimators constructed from dependent samples. To justify our proposed approach, we provide finite and infinite sample analyses.
arXiv Detail & Related papers (2020-02-13T02:04:17Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.