Related papers: Block Pruning For Faster Transformers

Related papers

Improved Methods for Model Pruning and Knowledge Distillation [3.8993503758122663]
MAMA Pruning is a performance optimization technique for large language models like R1 or o3-mini.<n>It effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels.<n>Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
arXiv Detail & Related papers (2025-05-20T07:53:40Z)
Mean Flows for One-step Generative Modeling [64.4997821467102]
We propose a principled and effective framework for one-step generative modeling.<n>A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training.<n>Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning.
arXiv Detail & Related papers (2025-05-19T17:59:42Z)
IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining [50.53912352342753]
We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery. We conduct experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
arXiv Detail & Related papers (2025-03-07T20:35:31Z)
MultiPruner: Balanced Structure Removal in Foundation Models [1.8434042562191815]
Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size. We extend BlockPruner and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy.
arXiv Detail & Related papers (2025-01-17T04:24:31Z)
Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement [9.454314879815337]
generative models often exhibit dominant singular vectors, hindering fine-tuning efficiency and leading to suboptimal performance. We introduce Singular Value Scaling (SVS), a versatile technique for refining pruned weights, applicable to both model types. SVS improves compression performance across model types without additional training costs.
arXiv Detail & Related papers (2024-12-23T08:40:08Z)
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router [55.88046193872355]
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. We propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights. Our pruning method is one-shot, requiring no retraining or weight updates.
arXiv Detail & Related papers (2024-10-15T19:22:27Z)
SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow [24.213303324584906]
We develop small, efficient one-step diffusion models based on the powerful rectified flow framework. We train a one-step diffusion model with an FID of 5.02 and 15.7M parameters, outperforming the previous state-of-the-art one-step diffusion model.
arXiv Detail & Related papers (2024-07-17T16:38:45Z)
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching [143.72720563387082]
Trajectory Stitching T-Stitch is a simple yet efficient technique to improve the sampling efficiency with little or no generation degradation. Our key insight is that different diffusion models learn similar encodings under the same training data distribution. Our method can also be used as a drop-in technique to accelerate the popular pretrained stable diffusion (SD) models.
arXiv Detail & Related papers (2024-02-21T23:08:54Z)
A-SDM: Accelerating Stable Diffusion through Redundancy Removal and Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network. We then prune the redundancy blocks of the model and maintain the network performance. Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z)
Federated Topic Model and Model Pruning Based on Variational Autoencoder [14.737942599204064]
Federated topic modeling allows multiple parties to jointly train models while protecting data privacy. This paper proposes a method to establish a federated topic model while ensuring the privacy of each node, and use neural network model pruning to accelerate the model. Experimental results show that the federated topic model pruning can greatly accelerate the model training speed while ensuring the model's performance.
arXiv Detail & Related papers (2023-11-01T06:00:14Z)
Finding the SWEET Spot: Analysis and Improvement of Adaptive Inference in Low Resource Settings [6.463202903076821]
We compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. Early-Exit provides a better speed-accuracy trade-off due to the overhead of the Multi-Model approach. We propose SWEET, an Early-Exit fine-tuning method that assigns each classifier its own set of unique model weights.
arXiv Detail & Related papers (2023-06-04T09:16:39Z)
Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning) CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z)
Movement Pruning: Adaptive Sparsity by Fine-Tuning [115.91907953454034]
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes.
arXiv Detail & Related papers (2020-05-15T17:54:15Z)
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.