Modular Transformers: Compressing Transformers into Modularized Layers
for Flexible Efficient Inference
- URL: http://arxiv.org/abs/2306.02379v1
- Date: Sun, 4 Jun 2023 15:26:28 GMT
- Title: Modular Transformers: Compressing Transformers into Modularized Layers
for Flexible Efficient Inference
- Authors: Wangchunshu Zhou, Ronan Le Bras, Yejin Choi
- Abstract summary: We introduce Modular Transformers, a modularized encoder-decoder framework for flexible sequence-to-sequence model compression.
After a single training phase, Modular Transformers can achieve flexible compression ratios from 1.1x to 6x with little to moderate relative performance drop.
- Score: 83.01121484432801
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained Transformer models like T5 and BART have advanced the state of
the art on a wide range of text generation tasks. Compressing these models into
smaller ones has become critically important for practical use. Common neural
network compression techniques such as knowledge distillation or quantization
are limited to static compression where the compression ratio is fixed. In this
paper, we introduce Modular Transformers, a modularized encoder-decoder
framework for flexible sequence-to-sequence model compression. Modular
Transformers train modularized layers that have the same function of two or
more consecutive layers in the original model via module replacing and
knowledge distillation. After training, the modularized layers can be flexibly
assembled into sequence-to-sequence models that meet different
performance-efficiency trade-offs. Experimental results show that after a
single training phase, by simply varying the assembling strategy, Modular
Transformers can achieve flexible compression ratios from 1.1x to 6x with
little to moderate relative performance drop.
Related papers
- ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation [0.6281017402518722]
ElastiFormer adapts pretrained Transformer models into an elastic counterpart with variable inference time compute.
routing modules are trained using self-distillation losses to minimize the differences between the output of the pretrained-model and their elastic counterparts.
arXiv Detail & Related papers (2024-11-22T16:11:14Z) - Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs.
We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z) - USDC: Unified Static and Dynamic Compression for Visual Transformer [17.10536016262485]
Visual Transformers have achieved great success in almost all vision tasks, such as classification, detection, and so on.
However, the model complexity and the inference speed of the visual transformers hinder their deployments in industrial products.
Various model compression techniques focus on directly compressing the visual transformers into a smaller one while maintaining the model performance, however, the performance drops dramatically when the compression ratio is large.
Several dynamic network techniques have also been applied to dynamically compress the visual transformers to obtain input-adaptive efficient sub-structures during the inference stage, which can achieve a better trade-off between the compression ratio and the model performance.
arXiv Detail & Related papers (2023-10-17T10:04:47Z) - Transformer Fusion with Optimal Transport [25.022849817421964]
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities.
This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components.
arXiv Detail & Related papers (2023-10-09T13:40:31Z) - Quantization-Aware and Tensor-Compressed Training of Transformers for
Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models.
A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer.
The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z) - UPDeT: Universal Multi-agent Reinforcement Learning via Policy
Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks.
Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy.
The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.