Related papers: FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers

FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers

URL: http://arxiv.org/abs/2411.14507v2
Date: Sat, 24 May 2025 04:19:51 GMT
Title: FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
Authors: Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu,
Abstract summary: Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains.<n>Recent works have observed redundancy within transformer blocks and developed compression methods by structured pruning of less important blocks.<n>We propose FuseGPT, a novel methodology designed to recycle pruned transformer blocks, thereby recovering the model's performance.
Score: 30.88764351013966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains, largely due to the extensive scaling of model parameters. Recent works have observed redundancy within transformer blocks and developed compression methods by structured pruning of less important blocks. However, such direct removal often leads to irreversible performance degradation. In this paper, we propose FuseGPT, a novel methodology designed to recycle pruned transformer blocks, thereby recovering the model's performance. Firstly, we introduce a new importance detection metric, Macro Influence (MI), which evaluates the long-term impact of each transformer block by quantifying the information loss incurred upon its removal. Next, we propose group-level layer fusion, which leverages the parameters from layers of less important blocks and integrates them into the corresponding layers of neighboring blocks. This fusion process is not a one-time operation but is refined through iterative parameter updates by lightweight group-level fine-tuning. Specifically, the injected parameters are frozen but are weighted with learnable rank decomposition matrices to reduce the computational overhead during fine-tuning. Our approach not only works well for large language models but also for large multimodal models. Experimental results indicate that, even with modest amounts of data, FuseGPT surpasses previous methods in both perplexity and zero-shot task performance.

Related papers

FlattenGPT: Depth Compression for Transformer with Layer Flattening [89.7587433008809]
textbfFlattenGPT is a novel way to detect and reduce depth-wise redundancies.<n>Experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance.
arXiv Detail & Related papers (2026-02-09T16:22:58Z)
Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation [41.16959587963631]
We transform a dense Diffusion Transformer (DiT) into a Mixture of Experts (MoE) for structured sparsification.<n>Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5%.<n>Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.
arXiv Detail & Related papers (2025-10-10T07:42:27Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
TinyFusion: Diffusion Transformers Learned Shallow [52.96232442322824]
Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization. We present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$times$ speedup with an FID score of 2.86.
arXiv Detail & Related papers (2024-12-02T07:05:39Z)
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [38.30350849992281]
"Recursive" language models share parameters across layers with minimal loss of performance. Recursive Transformers are efficiently from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We show that our models outperform both similar-sized vanilla pretrained models and knowledge distillation baselines.
arXiv Detail & Related papers (2024-10-28T02:15:45Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.<n>MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.<n>Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Few-Shot Class Incremental Learning via Robust Transformer Approach [16.590193619691416]
Few-Shot Class-Incremental Learning presents an extension of the Class Incremental Learning problem where a model is faced with the problem of data scarcity. This problem remains an open problem because all recent works are built upon the convolutional neural networks performing sub-optimally. Our paper presents Robust Transformer Approach built upon the Compact Convolution Transformer.
arXiv Detail & Related papers (2024-05-08T03:35:52Z)
MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public. ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z)
Transformer Fusion with Optimal Transport [25.022849817421964]
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components.
arXiv Detail & Related papers (2023-10-09T13:40:31Z)
Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning [6.451743797015637]
We propose memory-efficient fine-tuning (MEFT) for pre-trained language models. MEFT inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters.
arXiv Detail & Related papers (2023-06-01T09:26:17Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
Of Non-Linearity and Commutativity in BERT [8.295319152986316]
We study the interactions between layers in BERT and show that, while the layers exhibit some hierarchical structure, they extract features in a fuzzy manner. Our results suggest that BERT has an inductive bias towards layer commutativity, which we find is mainly due to the skip connections.
arXiv Detail & Related papers (2021-01-12T15:29:38Z)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers. When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.