Learning Parameter Sharing with Tensor Decompositions and Sparsity
- URL: http://arxiv.org/abs/2411.09816v3
- Date: Sun, 23 Feb 2025 20:02:33 GMT
- Title: Learning Parameter Sharing with Tensor Decompositions and Sparsity
- Authors: Cem Üyük, Mike Lasby, Mohamed Yassin, Utku Evci, Yani Ioannou,
- Abstract summary: We introduce Fine-grained Singular Sharing (FiPS) to compress large-scale Vision Transformers (ViTs) and Large Language Models (LLMs)<n>FiPS employs a shared base and sparse factors to represent neurons across multi-layer perceptron (MLP) modules.<n> Experimental results show that FiPS reduces the parameter budget by 50-75% for DeiT-B and Swin-L and by 40-50% for various Gemma-2 and Llama-3 models.
- Score: 5.73573685846194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large neural networks exhibit exceptional performance across numerous tasks, yet their considerable size often hinders deployment on resource-constrained systems. While various model compression strategies have been well studied, parameter sharing remains underexplored. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a novel algorithm that leverages parameter sharing, tensor decomposition, and sparsity to effectively compress large-scale Vision Transformers (ViTs) and Large Language Models (LLMs). FiPS employs a shared base and sparse factors to represent neurons across multi-layer perceptron (MLP) modules, where initialization is guided by Singular Value Decomposition (SVD) and subsequent optimization is conducted through block-wise reconstruction error minimization. Experimental results show that FiPS reduces the parameter budget of MLP modules by 50-75% for DeiT-B and Swin-L and by 40-50% for various Gemma-2 and Llama-3 models while maintaining ViT model accuracy within 1% pt. of the original and LLM perplexity with negligible degradation.
Related papers
- tCURLoRA: Tensor CUR Decomposition Based Low-Rank Parameter Adaptation and Its Application in Medical Image Segmentation [1.3281936946796913]
Transfer learning, by leveraging knowledge from pre-trained models, has significantly enhanced the performance of target tasks.
As deep neural networks scale up, full fine-tuning introduces substantial computational and storage challenges.
We propose tCURLoRA, a novel fine-tuning method based on tensor CUR decomposition.
arXiv Detail & Related papers (2025-01-04T08:25:32Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.
Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Convolutional Neural Network Compression Based on Low-Rank Decomposition [3.3295360710329738]
This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization.
VBMF is employed to estimate the rank of the weight tensor at each layer.
Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.
arXiv Detail & Related papers (2024-08-29T06:40:34Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - FoldGPT: Simple and Effective Large Language Model Compression Scheme [5.611544096046119]
Network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices.
We propose FoldGPT, which combines block removal and block parameter sharing.
Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression.
arXiv Detail & Related papers (2024-07-01T03:17:53Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models [9.244526043014098]
Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources.
In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure.
We propose a mixed compression model, which organically combines Low-Rank matrix And structured Pruning (LoRAP)
arXiv Detail & Related papers (2024-04-15T11:53:22Z) - Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank
Compression Strategy [5.699098817569033]
This paper introduces an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs.
The presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset.
In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain.
arXiv Detail & Related papers (2024-02-08T19:01:14Z) - Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model [81.55141188169621]
We equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios.
We propose an intra-block enhancement module, which introduces a linear projection head whose weights are generated from a hyper-complex layer.
Our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters.
arXiv Detail & Related papers (2023-11-28T11:23:34Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of
Large-Scale Pre-Trained Language Models [19.640997611256168]
We propose AlphaTuning, consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task.
Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors.
We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.
arXiv Detail & Related papers (2022-10-08T00:36:00Z) - Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition [17.73449206184214]
This paper proposes a parameter-efficient conformer via sharing sparsely-gated experts.
Specifically, we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a conformer block without increasing.
arXiv Detail & Related papers (2022-09-17T13:22:19Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - Learning Robust and Lightweight Model through Separable Structured
Transformations [13.208781763887947]
We propose a separable structural transformation of the fully-connected layer to reduce the parameters of convolutional neural networks.
We successfully reduce the amount of network parameters by 90%, while the robust accuracy loss is less than 1.5%.
We evaluate the proposed approach on datasets such as ImageNet, SVHN, CIFAR-100 and Vision Transformer.
arXiv Detail & Related papers (2021-12-27T07:25:26Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z) - A Variational Information Bottleneck Based Method to Compress Sequential
Networks for Human Action Recognition [9.414818018857316]
We propose a method to effectively compress Recurrent Neural Networks (RNNs) used for Human Action Recognition (HAR)
We use a Variational Information Bottleneck (VIB) theory-based pruning approach to limit the information flow through the sequential cells of RNNs to a small subset.
We combine our pruning method with a specific group-lasso regularization technique that significantly improves compression.
It is shown that our method achieves over 70 times greater compression than the nearest competitor with comparable accuracy for the task of action recognition on UCF11.
arXiv Detail & Related papers (2020-10-03T12:41:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.