Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
- URL: http://arxiv.org/abs/2507.07129v2
- Date: Tue, 04 Nov 2025 18:17:08 GMT
- Title: Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
- Authors: A. Bochkov,
- Abstract summary: The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training.<n>This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers.<n>We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack.
- Score: 1.0152838128195467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer's deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.
Related papers
- Theoretical Foundations of Scaling Law in Familial Models [46.506708373314375]
We introduce Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D)<n>Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent.<n> Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable.
arXiv Detail & Related papers (2025-12-29T12:01:58Z) - Scalable Class-Incremental Learning Based on Parametric Neural Collapse [5.550140856551579]
Incremental learning often encounter challenges such as overfitting to new data and catastrophic forgetting of old data.<n>We propose scalable class-incremental learning based on parametric neural collapse.
arXiv Detail & Related papers (2025-12-26T03:34:59Z) - Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs [49.66344956133349]
Reasoning capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models.<n>This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a latent variable for strategic contextualization.
arXiv Detail & Related papers (2025-12-19T03:32:53Z) - Learning Like Humans: Resource-Efficient Federated Fine-Tuning through Cognitive Developmental Stages [10.715400075722004]
Federated fine-tuning enables Large Language Models (LLMs) to adapt to downstream tasks while preserving data privacy.<n>We introduce Developmental Federated Tuning (DevFT), a resource-efficient approach inspired by cognitive development.<n>DevFT decomposes the fine-tuning process into developmental stages, each optimizing submodels with increasing parameter capacity.
arXiv Detail & Related papers (2025-07-31T09:36:43Z) - Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows [46.673228292287895]
We propose a novel framework that employs transformer-based autoregressive normalizing flows to model continuous representations.<n>This approach unlocks substantial flexibility, enabling the construction of models that can capture global bi-directional context.<n>We propose new mixture-based coupling transformations designed to capture complex dependencies within the latent space shaped by discrete data.
arXiv Detail & Related papers (2025-07-01T04:51:25Z) - Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework [2.187990941788468]
generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio.
Model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture.
arXiv Detail & Related papers (2024-08-01T08:22:47Z) - Mamba-FSCIL: Dynamic Adaptation with Selective State Space Model for Few-Shot Class-Incremental Learning [115.79349923044663]
Few-shot class-incremental learning (FSCIL) aims to incrementally learn novel classes from limited examples.<n>Existing methods face a critical dilemma: static architectures rely on a fixed parameter space to learn from data that arrive sequentially, prone to overfitting to the current session.<n>In this study, we explore the potential of Selective State Space Models (SSMs) for FSCIL.
arXiv Detail & Related papers (2024-07-08T17:09:39Z) - DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs [46.443316184807145]
We introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs)
Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples.
Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency.
arXiv Detail & Related papers (2024-07-03T18:34:08Z) - Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling [4.190836962132713]
This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms.
At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its conditioned kernel on input sequence.
We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality.
arXiv Detail & Related papers (2024-02-28T17:36:45Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - FOSTER: Feature Boosting and Compression for Class-Incremental Learning [52.603520403933985]
Deep neural networks suffer from catastrophic forgetting when learning new categories.
We propose a novel two-stage learning paradigm FOSTER, empowering the model to learn new categories adaptively.
arXiv Detail & Related papers (2022-04-10T11:38:33Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Learning Deep-Latent Hierarchies by Stacking Wasserstein Autoencoders [22.54887526392739]
We propose a novel approach to training models with deep-latent hierarchies based on Optimal Transport.
We show that our method enables the generative model to fully leverage its deep-latent hierarchy, avoiding the well known "latent variable collapse" issue of VAEs.
arXiv Detail & Related papers (2020-10-07T15:04:20Z) - S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures.
We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z) - Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model.
This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs)
The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.