ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation
- URL: http://arxiv.org/abs/2411.15281v1
- Date: Fri, 22 Nov 2024 16:11:14 GMT
- Title: ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation
- Authors: Junzhang Liu, Tingkai Liu, Yueyuan Sui, Stephen Xia,
- Abstract summary: ElastiFormer adapts pretrained Transformer models into an elastic counterpart with variable inference time compute.
routing modules are trained using self-distillation losses to minimize the differences between the output of the pretrained-model and their elastic counterparts.
- Score: 0.6281017402518722
- License:
- Abstract: We introduce ElastiFormer, a post-training technique that adapts pretrained Transformer models into an elastic counterpart with variable inference time compute. ElastiFormer introduces small routing modules (as low as .00006% additional trainable parameters) to dynamically selects subsets of network parameters and input tokens to be processed by each layer of the pretrained network in an inputdependent manner. The routing modules are trained using self-distillation losses to minimize the differences between the output of the pretrained-model and their elastic counterparts. As ElastiFormer makes no assumption regarding the modality of the pretrained Transformer model, it can be readily applied to all modalities covering causal language modeling, image modeling as well as visual-language modeling tasks. We show that 20% to 50% compute saving could be achieved for different components of the transformer layer, which could be further reduced by adding very low rank LoRA weights (rank 1) trained via the same distillation objective. Finally, by comparing routing trained on different subsets of ImageNet, we show that ElastiFormer is robust against the training domain.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.
We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.
The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together.
This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique.
In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z) - Analyzing and Improving the Training Dynamics of Diffusion Models [36.37845647984578]
We identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture.
We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity.
arXiv Detail & Related papers (2023-12-05T11:55:47Z) - Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training.
We then fit a hidden Markov model (HMM) over the resulting sequences of metrics.
We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via
Layer Consistency [31.572652956170252]
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance.
We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.
arXiv Detail & Related papers (2021-04-08T08:21:59Z) - Learning to Learn Parameterized Classification Networks for Scalable
Input Images [76.44375136492827]
Convolutional Neural Networks (CNNs) do not have a predictable recognition behavior with respect to the input resolution change.
We employ meta learners to generate convolutional weights of main networks for various input scales.
We further utilize knowledge distillation on the fly over model predictions based on different input resolutions.
arXiv Detail & Related papers (2020-07-13T04:27:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.