Related papers: The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

URL: http://arxiv.org/abs/2310.04680v1
Date: Sat, 7 Oct 2023 03:36:39 GMT
Title: The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning
Authors: Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite
Abstract summary: We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model. We find a striking difference in how these two abilities evolve due to scaling. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.
Score: 34.76303922401322
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.

Related papers

Scaling Inference-Efficient Language Models [3.271571137474847]
We show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency.<n>We modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture.<n>We release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks.
arXiv Detail & Related papers (2025-01-30T03:16:44Z)
Scaling Law for Language Models Training Considering Batch Size [17.09348741898811]
Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. We empirically investigate how a critical hyper- parameter, i.e., the global batch size, influences the LLM training prdocess. We establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models.
arXiv Detail & Related papers (2024-12-02T13:58:35Z)
Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications. Memorisation is the causal effect of training with an instance on the model's ability to predict that instance. This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z)
Emergent Abilities in Reduced-Scale Generative Language Models [10.51168925267033]
Large language models can solve new tasks without task-specific fine-tuning. This ability is considered an emergent ability and is primarily seen in large language models with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data.
arXiv Detail & Related papers (2024-04-02T18:00:28Z)
Scaling Laws For Dense Retrieval [22.76001461620846]
We investigate whether the performance of dense retrieval models follows the scaling law as other neural models. Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations.
arXiv Detail & Related papers (2024-03-27T15:27:36Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)
Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models [92.11542797811461]
We introduce NeQA, a dataset consisting of questions with negation. We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling. We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point.
arXiv Detail & Related papers (2023-05-27T00:07:17Z)
Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale [5.759319006531332]
We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters. We examine downscaling effects, extending scaling laws to models as small as 1M parameters.
arXiv Detail & Related papers (2023-05-26T21:22:10Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Do Language Embeddings Capture Scales? [54.1633257459927]
We show that pretrained language models capture a significant amount of information about the scalar magnitudes of objects. We identify contextual information in pre-training and numeracy as two key factors affecting their performance.
arXiv Detail & Related papers (2020-10-11T21:11:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.