Related papers: Tending Towards Stability: Convergence Challenges in Small Language Models

Tending Towards Stability: Convergence Challenges in Small Language Models

URL: http://arxiv.org/abs/2410.11451v1
Date: Tue, 15 Oct 2024 09:57:19 GMT
Title: Tending Towards Stability: Convergence Challenges in Small Language Models
Authors: Richard Diehl Martinez, Pietro Lesci, Paula Buttery,
Abstract summary: Despite their advantages, smaller models frequently underperform compared to their larger counterparts. This is anecdotally attributed to their reduced representational capacity. By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.
Score: 3.734405405403176
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Increasing the number of parameters in language models is a common strategy to enhance their performance. However, smaller language models remain valuable due to their lower operational costs. Despite their advantages, smaller models frequently underperform compared to their larger counterparts, even when provided with equivalent data and computational resources. Specifically, their performance tends to degrade in the late pretraining phase. This is anecdotally attributed to their reduced representational capacity. Yet, the exact causes of this performance degradation remain unclear. We use the Pythia model suite to analyse the training dynamics that underlie this phenomenon. Across different model sizes, we investigate the convergence of the Attention and MLP activations to their final state and examine how the effective rank of their parameters influences this process. We find that nearly all layers in larger models stabilise early in training - within the first 20% - whereas layers in smaller models exhibit slower and less stable convergence, especially when their parameters have lower effective rank. By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.

Related papers

LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
Effects of Scale on Language Model Robustness [7.725206196110384]
We show that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models. We also analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others.
arXiv Detail & Related papers (2024-07-25T17:26:41Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck [11.416426888383873]
We find that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. This can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining.
arXiv Detail & Related papers (2024-04-11T11:10:36Z)
Understanding Emergent Abilities of Language Models from the Loss Perspective [32.81782726603632]
We study emergent abilities in the lens of pre-training loss, instead of model size or training compute. We find that a model exhibits emergent abilities on certain tasks regardless of the continuity of metrics. This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses.
arXiv Detail & Related papers (2024-03-23T11:03:31Z)
Small-scale proxies for large-scale Transformer training instabilities [69.36381318171338]
We seek ways to reproduce and study training stability and instability at smaller scales. By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates. We study methods such as warm-up, weight decay, and the $mu$Param to train small models that achieve similar losses across orders of magnitude of learning rate variation.
arXiv Detail & Related papers (2023-09-25T17:48:51Z)
Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale [5.759319006531332]
We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters. We examine downscaling effects, extending scaling laws to models as small as 1M parameters.
arXiv Detail & Related papers (2023-05-26T21:22:10Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models [46.24479693469042]
This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
arXiv Detail & Related papers (2022-10-25T17:45:36Z)
Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models. Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely. Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z)
On the Effect of Dropping Layers of Pre-trained Transformer Models [35.25025837133909]
We explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping
arXiv Detail & Related papers (2020-04-08T07:09:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.