Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
- URL: http://arxiv.org/abs/2210.09261v1
- Date: Mon, 17 Oct 2022 17:08:26 GMT
- Title: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
- Authors: Mirac Suzgun, Nathan Scales, Nathanael Sch\"arli, Sebastian Gehrmann,
Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny
Zhou, Jason Wei
- Abstract summary: We focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH)
We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass the average human-rater performance on 17 of the 23 tasks.
- Score: 108.54545521369688
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that
focuses on tasks believed to be beyond the capabilities of current language
models. Language models have already made good progress on this benchmark, with
the best model in the BIG-Bench paper outperforming average reported
human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But
on what tasks do language models fall short of average human-rater performance,
and are those tasks actually unsolvable by current language models?
In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we
call BIG-Bench Hard (BBH). These are the task for which prior language model
evaluations did not outperform the average human-rater. We find that applying
chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the
average human-rater performance on 10 of the 23 tasks, and Codex
(code-davinci-002) to surpass the average human-rater performance on 17 of the
23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot
prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al.,
2022), substantially underestimates the best performance and capabilities of
language models, which is better captured via CoT prompting. As further
analysis, we explore the interaction between CoT and model scale on BBH,
finding that CoT enables emergent task performance on several BBH tasks with
otherwise flat scaling curves.
Related papers
- Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models [65.48185395952788]
Buffer of Thoughts (BoT) is a novel and versatile thought-augmented reasoning approach.
We propose meta-buffer to store a series of informative high-level thoughts.
For each problem, we retrieve a relevant thought-template and adaptively instantiate it with specific reasoning structures.
arXiv Detail & Related papers (2024-06-06T17:22:08Z) - Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection [2.2724928083094196]
This work looks at the performance of a range of LLMs on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE.
We find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales.
arXiv Detail & Related papers (2024-05-15T11:55:14Z) - How predictable is language model benchmark performance? [0.07143413923310668]
We show that average benchmark performance, aggregating over many individual tasks, is decently predictable as a function of training compute scale.
Individual task performance remains significantly more predictable than chance.
arXiv Detail & Related papers (2024-01-09T17:34:30Z) - How Predictable Are Large Language Model Capabilities? A Case Study on
BIG-bench [52.11481619456093]
We study the performance prediction problem on experiment records from BIG-bench.
An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records.
We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
arXiv Detail & Related papers (2023-05-24T09:35:34Z) - Task Ambiguity in Humans and Language Models [7.033374427612259]
We propose AmbiBench, a new benchmark of ambiguously-specified classification tasks.
We evaluate humans and models on AmbiBench by seeing how well they identify the intended task.
We show how to dramatically improve the accuracy of language models trained without large-scale human feedback training.
arXiv Detail & Related papers (2022-12-20T18:35:33Z) - Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models [648.3665819567409]
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale.
Big-bench consists of 204 tasks, contributed by 450 authors across 132 institutions.
We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench.
arXiv Detail & Related papers (2022-06-09T17:05:34Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - Multitask Prompted Training Enables Zero-Shot Task Generalization [70.12770442071657]
We develop a system for mapping general natural language tasks into a human-readable prompted form.
We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks.
The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.
arXiv Detail & Related papers (2021-10-15T17:08:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.