Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
- URL: http://arxiv.org/abs/2411.00640v1
- Date: Fri, 01 Nov 2024 14:57:16 GMT
- Title: Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
- Authors: Evan Miller,
- Abstract summary: The literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning.
This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations.
- Score: 0.6526824510982799
- License:
- Abstract: Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.
Related papers
- NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity.
A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z) - ElicitationGPT: Text Elicitation Mechanisms via Language Models [12.945581341789431]
This paper develops mechanisms for scoring elicited text against ground truth text using domain-knowledge-free queries to a large language model.
An empirical evaluation is conducted on peer reviews from a peer-grading dataset and in comparison to manual instructor scores for the peer reviews.
arXiv Detail & Related papers (2024-06-13T17:49:10Z) - Is Data Valuation Learnable and Interpretable? [3.9325957466009203]
Current data valuation methods ignore the interpretability of the output values.
This study aims to answer an important question: is data valuation learnable and interpretable?
arXiv Detail & Related papers (2024-06-03T08:13:47Z) - Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - Computational Models to Study Language Processing in the Human Brain: A Survey [47.81066391664416]
This paper reviews efforts in using computational models for brain research, highlighting emerging trends.
Our analysis reveals that no single model outperforms others on all datasets.
arXiv Detail & Related papers (2024-03-20T08:01:22Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Multi-Dimensional Evaluation of Text Summarization with In-Context
Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning.
Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization.
We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - How to Evaluate a Summarizer: Study Design and Statistical Analysis for
Manual Linguistic Quality Evaluation [3.624563211765782]
We show that best choice of evaluation method can vary from one aspect to another.
We show that the total number of annotators can have a strong impact on study power.
Current statistical analysis methods can inflate type I error rates up to eight-fold.
arXiv Detail & Related papers (2021-01-27T10:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.