The Cost of Down-Scaling Language Models: Fact Recall Deteriorates
before In-Context Learning
- URL: http://arxiv.org/abs/2310.04680v1
- Date: Sat, 7 Oct 2023 03:36:39 GMT
- Title: The Cost of Down-Scaling Language Models: Fact Recall Deteriorates
before In-Context Learning
- Authors: Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael
Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite
- Abstract summary: We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model.
We find a striking difference in how these two abilities evolve due to scaling.
The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.
- Score: 34.76303922401322
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How does scaling the number of parameters in large language models (LLMs)
affect their core capabilities? We study two natural scaling techniques --
weight pruning and simply training a smaller or larger model, which we refer to
as dense scaling -- and their effects on two core capabilities of LLMs: (a)
recalling facts presented during pre-training and (b) processing information
presented in-context during inference. By curating a suite of tasks that help
disentangle these two capabilities, we find a striking difference in how these
two abilities evolve due to scaling. Reducing the model size by more than 30\%
(via either scaling approach) significantly decreases the ability to recall
facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the
various ways the model can process in-context information, ranging from
retrieving answers from a long context to learning parameterized functions from
in-context exemplars. The fact that both dense scaling and weight pruning
exhibit this behavior suggests that scaling model size has an inherently
disparate effect on fact recall and in-context learning.
Related papers
- Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Emergent Abilities in Reduced-Scale Generative Language Models [10.51168925267033]
Large language models can solve new tasks without task-specific fine-tuning.
This ability is considered an emergent ability and is primarily seen in large language models with billions of parameters.
This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data.
arXiv Detail & Related papers (2024-04-02T18:00:28Z) - Scaling Laws For Dense Retrieval [22.76001461620846]
We investigate whether the performance of dense retrieval models follows the scaling law as other neural models.
Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations.
arXiv Detail & Related papers (2024-03-27T15:27:36Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language
Models [92.11542797811461]
We introduce NeQA, a dataset consisting of questions with negation.
We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling.
We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point.
arXiv Detail & Related papers (2023-05-27T00:07:17Z) - Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale [5.759319006531332]
We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters.
We examine downscaling effects, extending scaling laws to models as small as 1M parameters.
arXiv Detail & Related papers (2023-05-26T21:22:10Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Do Language Embeddings Capture Scales? [54.1633257459927]
We show that pretrained language models capture a significant amount of information about the scalar magnitudes of objects.
We identify contextual information in pre-training and numeracy as two key factors affecting their performance.
arXiv Detail & Related papers (2020-10-11T21:11:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.