Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training
- URL: http://arxiv.org/abs/2509.06518v1
- Date: Mon, 08 Sep 2025 10:24:19 GMT
- Title: Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training
- Authors: Andrei Baroian, Kasper Notebomer,
- Abstract summary: Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs.<n>We introduce three new literature variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear in the pre-training stage.<n>We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens.<n>All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.
Related papers
- Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation [10.804106052326402]
Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging.<n>We propose BOLT, a framework that reuses existing fine-tuned models and adapts within that subspace.<n>Our results show that constraining adaptation to a task-informed subspace provides an effective alternative for unseen-task transfer.
arXiv Detail & Related papers (2025-12-02T06:00:16Z) - RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining.<n>We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers.<n>We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive.
Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs)
We show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.
arXiv Detail & Related papers (2024-06-24T08:43:21Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Equivariant Architectures for Learning in Deep Weight Spaces [54.61765488960555]
We present a novel network architecture for learning in deep weight spaces.
It takes as input a concatenation of weights and biases of a pre-trainedvariant.
We show how these layers can be implemented using three basic operations.
arXiv Detail & Related papers (2023-01-30T10:50:33Z) - Benchmarking down-scaled (not so large) pre-trained language models [0.0]
Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes.
We compare three pre-training objectives for different shape parameters and model sizes, while also varying the number of pre-training steps and the batch size.
In our experiments NSP +BERT-style consistently outperforms (RoBERTa-style) as well as the standard LM objective.
arXiv Detail & Related papers (2021-05-11T09:01:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.