(Mis)Fitting: A Survey of Scaling Laws
- URL: http://arxiv.org/abs/2502.18969v1
- Date: Wed, 26 Feb 2025 09:27:54 GMT
- Title: (Mis)Fitting: A Survey of Scaling Laws
- Authors: Margaret Li, Sneha Kudugunta, Luke Zettlemoyer,
- Abstract summary: We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio.<n>We survey over 50 papers that study scaling trends.<n>We propose a checklist for authors to consider while contributing to scaling law research.
- Score: 52.598843243928584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern foundation models rely heavily on using scaling laws to guide crucial training decisions. Researchers often extrapolate the optimal architecture and hyper parameters settings from smaller training runs by describing the relationship between, loss, or task performance, and scale. All components of this process vary, from the specific equation being fit, to the training setup, to the optimization method. Each of these factors may affect the fitted law, and therefore, the conclusions of a given study. We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio. We augment this discussion with our own analysis of the critical impact that changes in specific details may effect in a scaling study, and the resulting altered conclusions. Additionally, we survey over 50 papers that study scaling trends: while 45 of these papers quantify these trends using a power law, most under-report crucial details needed to reproduce their findings. To mitigate this, we we propose a checklist for authors to consider while contributing to scaling law research.
Related papers
- A Survey of Scaling in Large Language Model Reasoning [62.92861523305361]
We provide a comprehensive examination of scaling in large Language models (LLMs) reasoning.
We analyze scaling in reasoning steps that improves multi-step inference and logical consistency.
We discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement.
arXiv Detail & Related papers (2025-04-02T23:51:27Z) - Scaling Law for Language Models Training Considering Batch Size [17.09348741898811]
Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress.
We empirically investigate how a critical hyper- parameter, i.e., the global batch size, influences the LLM training prdocess.
We establish a basic scaling law on model size and training data amount.
We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models.
arXiv Detail & Related papers (2024-12-02T13:58:35Z) - A Simple Model of Inference Scaling Laws [1.3597551064547502]
We study scaling laws in the context of inference, specifically how performance improves with multiple inference attempts.
Our simple framework sets the ground for incorporating inference scaling with other known scaling laws.
arXiv Detail & Related papers (2024-10-21T18:00:06Z) - Revisiting the Superficial Alignment Hypothesis [0.9831489366502302]
The Superficial Alignment Hypothesis posits that almost all of a language model's abilities and knowledge are learned during pre-training.
We re-examine these claims by studying the scaling behavior of post-training with increasing finetuning examples.
arXiv Detail & Related papers (2024-09-27T22:14:10Z) - See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition [56.87609859444084]
parameter-efficient fine-tuning (PEFT) focuses on optimizing a select subset of parameters while keeping the rest fixed, significantly lowering computational and storage overheads.
We take the first step to unify all approaches by dissecting them from a decomposition perspective.
We introduce two novel PEFT methods alongside a simple yet effective framework designed to enhance the performance of PEFT techniques across various applications.
arXiv Detail & Related papers (2024-07-07T15:44:42Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - Scaling Laws For Dense Retrieval [22.76001461620846]
We investigate whether the performance of dense retrieval models follows the scaling law as other neural models.
Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations.
arXiv Detail & Related papers (2024-03-27T15:27:36Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - Training Data Influence Analysis and Estimation: A Survey [25.460140245596918]
We provide the first comprehensive survey of training data influence analysis and estimation.
We organize state-of-the-art influence analysis methods into a taxonomy.
We propose future research directions to make influence analysis more useful in practice.
arXiv Detail & Related papers (2022-12-09T00:32:46Z) - Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z) - Everything Has a Cause: Leveraging Causal Inference in Legal Text
Analysis [62.44432226563088]
Causal inference is the process of capturing cause-effect relationship among variables.
We propose a novel Graph-based Causal Inference framework, which builds causal graphs from fact descriptions without much human involvement.
We observe that the causal knowledge contained in GCI can be effectively injected into powerful neural networks for better performance and interpretability.
arXiv Detail & Related papers (2021-04-19T16:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.