How Predictable Are Large Language Model Capabilities? A Case Study on
BIG-bench
- URL: http://arxiv.org/abs/2305.14947v2
- Date: Tue, 31 Oct 2023 17:27:07 GMT
- Title: How Predictable Are Large Language Model Capabilities? A Case Study on
BIG-bench
- Authors: Qinyuan Ye, Harvey Yiyun Fu, Xiang Ren, Robin Jia
- Abstract summary: We study the performance prediction problem on experiment records from BIG-bench.
An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records.
We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
- Score: 52.11481619456093
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We investigate the predictability of large language model (LLM) capabilities:
given records of past experiments using different model families, numbers of
parameters, tasks, and numbers of in-context examples, can we accurately
predict LLM performance on new experiment configurations? Answering this
question has practical implications for LLM users (e.g., deciding which models
to try), developers (e.g., prioritizing evaluation on representative tasks),
and the research community (e.g., identifying hard-to-predict capabilities that
warrant further investigation).
We study the performance prediction problem on experiment records from
BIG-bench. On a random train-test split, an MLP-based predictor achieves an
$R^2$ score greater than 95%, indicating the presence of learnable patterns
within the experiment records. We then formulate the problem of searching for
"small-bench," an informative subset of BIG-bench tasks from which the
performance on the full set can be maximally recovered. We find a subset as
informative as BIG-bench Hard for evaluating new model families, while being
$3\times$ smaller. Additionally, we find competitive subsets by clustering task
representations learned by our MLP-based predictor and selecting tasks close to
cluster centroids, highlighting the importance of task diversity in
constructing "small-bench."
Related papers
- Great Memory, Shallow Reasoning: Limits of $k$NN-LMs [71.73611113995143]
$k$NN-LMs, which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling.
We ask whether this improved ability to recall information really translates into downstream abilities.
arXiv Detail & Related papers (2024-08-21T17:59:05Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - GistScore: Learning Better Representations for In-Context Example
Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts.
We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning.
We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z) - Numeracy from Literacy: Data Science as an Emergent Skill from Large
Language Models [0.0]
Large language models (LLM) such as OpenAI's ChatGPT and GPT-3 offer unique testbeds for exploring the translation challenges of turning literacy into numeracy.
Previous publicly-available transformer models from eighteen months prior and 1000 times smaller failed to provide basic arithmetic.
This work examines whether next-token prediction succeeds from sentence completion into the realm of actual numerical understanding.
arXiv Detail & Related papers (2023-01-31T03:14:57Z) - The Devil is in Classification: A Simple Framework for Long-tail Object
Detection and Instance Segmentation [93.17367076148348]
We investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset.
We unveil that a major cause is the inaccurate classification of object proposals.
We propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach.
arXiv Detail & Related papers (2020-07-23T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.