Is It a Free Lunch for Removing Outliers during Pretraining?
- URL: http://arxiv.org/abs/2402.12102v1
- Date: Mon, 19 Feb 2024 12:45:52 GMT
- Title: Is It a Free Lunch for Removing Outliers during Pretraining?
- Authors: Baohao Liao, Christof Monz
- Abstract summary: We introduce a novel softmax function aimed at pretraining models in an outlier-free manner.
We show that such an approach leads to performance degradation in full precision.
We enhance the method by ensuring its normalization is invariant to sequence length.
- Score: 7.621880623381026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growing size of large language models, the role of quantization
becomes increasingly significant. However, outliers present in weights or
activations notably influence the performance of quantized models. Recently,
\citet{qtransformer} introduced a novel softmax function aimed at pretraining
models in an outlier-free manner, thereby enhancing their suitability for
quantization. Interestingly, we observed that such an approach leads to
performance degradation in full precision. Building on this insight, we enhance
the method by ensuring its normalization is invariant to sequence length, a
crucial factor for bridging the gap between pretraining and fine-tuning.
Moreover, this improved method also facilitates successful pretraining of
causal language models.
Related papers
- Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models [7.80071686970278]
Traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases.
This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm.
We create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths.
arXiv Detail & Related papers (2025-01-23T07:21:08Z) - Dissecting Bit-Level Scaling Laws in Quantizing Vision Generative Models [13.937690707239177]
We show that language-style models consistently outperform diffusion-style models across various quantization settings.
This observation suggests that language-style models have superior bit-level scaling laws, offering a better tradeoff between model quality and total bits.
We propose TopKLD to optimize the transfer of distilled knowledge by balancing implicit knowledge'' and explicit knowledge'' during the distillation process.
arXiv Detail & Related papers (2025-01-06T14:23:07Z) - Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.
For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.
For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z) - Exploring Quantization for Efficient Pre-Training of Transformer Language Models [11.696132057489786]
This study aims to explore the impact of quantization for efficient pre-training of Transformers.
By systematically applying straightforward linear quantization to weights, activations, gradients, and states, we assess its effects on model efficiency, stability, and performance during training.
arXiv Detail & Related papers (2024-07-16T13:42:09Z) - Scalable Ensembling For Mitigating Reward Overoptimisation [24.58937616758007]
Reinforcement Learning from Human Feedback has enabled significant advancements within language modeling for powerful, instruction-following models.
The alignment of these models remains a pressing challenge as the policy tends to overfit the learned proxy" reward model past an inflection point of utility.
arXiv Detail & Related papers (2024-06-03T05:46:53Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - Mixtures of Experts Unlock Parameter Scaling for Deep RL [54.26191237981469]
In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules into value-based networks results in more parameter-scalable models.
This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
arXiv Detail & Related papers (2024-02-13T17:18:56Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.