Related papers: Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding

URL: http://arxiv.org/abs/2009.03300v3
Date: Tue, 12 Jan 2021 18:57:11 GMT
Title: Measuring Massive Multitask Language Understanding
Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
Abstract summary: The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. Largest GPT-3 model improves over random chance by almost 20 percentage points on average. Models also have lopsided performance and frequently do not know when they are wrong.
Score: 79.6985576698597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Related papers

ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z)
Lawma: The Power of Specialization for Legal Tasks [18.45967769381101]
We study 260 legal text classification tasks, nearly all new to the machine learning community. A lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points. We find that larger models respond better to fine-tuning than smaller models.
arXiv Detail & Related papers (2024-07-23T16:23:04Z)
Changing Answer Order Can Decrease MMLU Accuracy [18.774650080306944]
We investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive.
arXiv Detail & Related papers (2024-06-27T18:21:32Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets. We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
Trusting Language Models in Education [1.2578554943276923]
We propose to use an XGBoost on top of BERT to output the corrected probabilities. Our hypothesis is that the level of uncertainty contained in the flow of attention is related to the quality of the model's response itself.
arXiv Detail & Related papers (2023-08-07T18:27:54Z)
Measuring Massive Multitask Chinese Understanding [16.41629318344805]
This test encompasses four major domains, including medicine, law, psychology, and education. The best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239.
arXiv Detail & Related papers (2023-04-25T16:51:53Z)
Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z)
Language Models (Mostly) Know What They Know [10.836210010868932]
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer.
arXiv Detail & Related papers (2022-07-11T22:59:39Z)
Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models. Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers. We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z)
Boosting a Model Zoo for Multi-Task and Continual Learning [15.110807414130923]
"Model Zoo" is an algorithm that builds an ensemble of models, each of which is very small, and it is trained on a smaller set of tasks. Model Zoo achieves large gains in prediction accuracy compared to state-of-the-art methods in multi-task and continual learning.
arXiv Detail & Related papers (2021-06-06T04:25:09Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.