Related papers: Scaling Laws for Precision

Scaling Laws for Precision

URL: http://arxiv.org/abs/2411.04330v1
Date: Thu, 07 Nov 2024 00:10:10 GMT
Title: Scaling Laws for Precision
Authors: Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan,
Abstract summary: We devise "precision-aware" scaling laws for both training and inference. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
Score: 73.24325358259753
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.

Related papers

Scaling Inference-Efficient Language Models [3.271571137474847]
We show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency. We modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. We release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks compared to open-source models.
arXiv Detail & Related papers (2025-01-30T03:16:44Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
Establishing Task Scaling Laws via Compute-Efficient Model Ladders [123.8193940110293]
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. We leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance.
arXiv Detail & Related papers (2024-12-05T18:21:49Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z)
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [62.132347451049455]
Scale has become a main ingredient in obtaining strong machine learning models. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule. We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
arXiv Detail & Related papers (2024-05-28T17:33:54Z)
Unraveling the Mystery of Scaling Laws: Part I [39.967120253159614]
Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. The original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas. We provide step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M60M parameters.
arXiv Detail & Related papers (2024-03-11T10:05:29Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)
Scaling of Class-wise Training Losses for Post-hoc Calibration [6.0632746602205865]
We propose a new calibration method to synchronize the class-wise training losses. We design a new training loss to alleviate the variance of class-wise training losses by using multiple class-wise scaling factors. We validate the proposed framework by employing it in the various post-hoc calibration methods.
arXiv Detail & Related papers (2023-06-19T14:59:37Z)
Adaptive Low-Precision Training for Embeddings in Click-Through Rate Prediction [36.605153166169224]
Embedding tables are usually huge in click-through rate (CTR) prediction models. We formulate a novel quantization training paradigm to compress the embeddings from the training stage, termed low-precision training. For the first time in CTR models, we successfully train 8-bit embeddings without sacrificing prediction accuracy.
arXiv Detail & Related papers (2022-12-12T07:19:14Z)
Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships. We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.