Related papers: Transformer as Linear Expansion of Learngene

Transformer as Linear Expansion of Learngene

URL: http://arxiv.org/abs/2312.05614v2
Date: Wed, 20 Dec 2023 05:59:10 GMT
Title: Transformer as Linear Expansion of Learngene
Authors: Shiyu Xia, Miaosen Zhang, Xu Yang, Ruiming Chen, Haokun Chen, Xin Geng
Abstract summary: Linear Expansion of learnGene (TLEG) is a novel approach for flexibly producing and initializing Transformers of diverse depths. Experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch.
Score: 38.16612771203953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the well-trained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2x training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19x parameters stored to initialize these models and around 5x pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9x parameters stored to initialize, compared to the pre-training approach.

Related papers

Learning and Transferring Sparse Contextual Bigrams with Linear Transformers [47.37256334633102]
We introduce the Sparse Con Bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token. We analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm. We prove that, provided a nontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allows us to bypass the initial sample-intensive stage.
arXiv Detail & Related papers (2024-10-30T20:29:10Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Linearizing Large Language Models [26.94551511277412]
We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models.
arXiv Detail & Related papers (2024-05-10T17:59:08Z)
Tangent Transformers for Composition, Privacy and Removal [58.280295030852194]
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers. Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
arXiv Detail & Related papers (2023-07-16T18:31:25Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
Semantic-visual Guided Transformer for Few-shot Class-incremental Learning [6.300141694311465]
We develop a semantic-visual guided Transformer (SV-T) to enhance the feature extracting capacity of the pre-trained feature backbone on incremental classes. Our SV-T can take full advantage of more supervision information from base classes and further enhance the training robustness of the feature backbone.
arXiv Detail & Related papers (2023-03-27T15:06:49Z)
Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z)
Decision Transformer: Reinforcement Learning via Sequence Modeling [102.86873656751489]
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. We present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
arXiv Detail & Related papers (2021-06-02T17:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.