Related papers: Predicting Emergent Capabilities by Finetuning

Predicting Emergent Capabilities by Finetuning

URL: http://arxiv.org/abs/2411.16035v1
Date: Mon, 25 Nov 2024 01:48:09 GMT
Title: Predicting Emergent Capabilities by Finetuning
Authors: Charlie Snell, Eric Wallace, Dan Klein, Sergey Levine,
Abstract summary: We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models. We validate this approach using four standard NLP benchmarks. We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
Score: 98.9684114851891
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable -- sometimes even exhibiting emergent jumps -- which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., "emergence laws"). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.

Related papers

From Text to Time? Rethinking the Effectiveness of the Large Language Model for Time Series Forecasting [22.052783052469344]
Using pre-trained large language models (LLMs) as the backbone for time series prediction has recently gained significant research interest. We observe that training and testing LLM-based models on small datasets often leads to the Decoder and Decoder becoming overly adapted to the dataset. Extensive experiments reveal that although the LLM backbone demonstrates some promise, its forecasting performance is limited.
arXiv Detail & Related papers (2025-04-09T13:20:09Z)
Establishing Task Scaling Laws via Compute-Efficient Model Ladders [123.8193940110293]
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. We leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance.
arXiv Detail & Related papers (2024-12-05T18:21:49Z)
Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z)
Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation. We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources. We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z)
LLMs are Not Just Next Token Predictors [0.0]
LLMs are statistical models of language learning through gradient descent with a next token prediction objective. While LLMs are engineered using next token prediction, and trained based on their success at this task, our view is that a reduction to just next token predictor sells LLMs short. In order to draw this out, we will make an analogy with a once prominent research program in biology explaining evolution and development from the gene's eye view.
arXiv Detail & Related papers (2024-08-06T16:36:28Z)
MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning [11.174544614042984]
During fine-tuning, large language models (LLMs) may forget the knowledge acquired in the pre-training stage, leading to a decline in general capabilities. We propose a new fine-tuning algorithm termed Momentum-Filtered algorithm (MoFO) MoFO achieves similar fine-tuning performance while keeping parameters closer to the pre-trained model.
arXiv Detail & Related papers (2024-07-30T17:38:24Z)
Can Language Models Use Forecasting Strategies? [14.332379032371612]
We describe experiments using a novel dataset of real world events and associated human predictions. We find that models still struggle to make accurate predictions about the future.
arXiv Detail & Related papers (2024-06-06T19:01:42Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
Temporal Scaling Law for Large Language Models [57.83580734589091]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.<n>In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.<n>We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z)
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve [21.55766758950951]
We make predictions about the strategies that large language models will adopt to solve next-word prediction tasks. We evaluate two LLMs on eleven tasks and find robust evidence that LLMs are influenced by probability. We conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system.
arXiv Detail & Related papers (2023-09-24T13:35:28Z)
Making Pre-trained Language Models both Task-solvers and Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems. Previous work shows that introducing an extra calibration task can mitigate this issue. We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z)
nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws. With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B. Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.