In defense of parameter sharing for model-compression
- URL: http://arxiv.org/abs/2310.11611v1
- Date: Tue, 17 Oct 2023 22:08:01 GMT
- Title: In defense of parameter sharing for model-compression
- Authors: Aditya Desai, Anshumali Shrivastava
- Abstract summary: randomized parameter-sharing (RPS) methods have gained traction for model compression at start of training.
RPS consistently outperforms/matches smaller models and all moderately informed pruning strategies.
This paper argues in favor of paradigm shift towards RPS based models.
- Score: 38.80110838121722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When considering a model architecture, there are several ways to reduce its
memory footprint. Historically, popular approaches included selecting smaller
architectures and creating sparse networks through pruning. More recently,
randomized parameter-sharing (RPS) methods have gained traction for model
compression at start of training. In this paper, we comprehensively assess the
trade-off between memory and accuracy across RPS, pruning techniques, and
building smaller models. Our findings demonstrate that RPS, which is both data
and model-agnostic, consistently outperforms/matches smaller models and all
moderately informed pruning strategies, such as MAG, SNIP, SYNFLOW, and GRASP,
across the entire compression range. This advantage becomes particularly
pronounced in higher compression scenarios. Notably, even when compared to
highly informed pruning techniques like Lottery Ticket Rewinding (LTR), RPS
exhibits superior performance in high compression settings. This points out
inherent capacity advantage that RPS enjoys over sparse models. Theoretically,
we establish RPS as a superior technique in terms of memory-efficient
representation when compared to pruning for linear models. This paper argues in
favor of paradigm shift towards RPS based models. During our rigorous
evaluation of RPS, we identified issues in the state-of-the-art RPS technique
ROAST, specifically regarding stability (ROAST's sensitivity to initialization
hyperparameters, often leading to divergence) and Pareto-continuity (ROAST's
inability to recover the accuracy of the original model at zero compression).
We provably address both of these issues. We refer to the modified RPS, which
incorporates our improvements, as STABLE-RPS.
Related papers
- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism [25.36736897890854]
Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks.
Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data.
We propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective.
arXiv Detail & Related papers (2024-08-20T01:05:45Z) - GRFormer: Grouped Residual Self-Attention for Lightweight Single Image Super-Resolution [2.312414367096445]
Grouped Residual Self-Attention (GRSA) is specifically oriented towards two fundamental components.
ES-RPB is a substitute for the original relative position bias to improve the ability to represent position information.
Experiments demonstrate GRFormer outperforms state-of-the-art transformer-based methods for $times$2, $times$3 and $times$4 SISR tasks.
arXiv Detail & Related papers (2024-08-14T11:56:35Z) - Unified Low-rank Compression Framework for Click-through Rate Prediction [15.813889566241539]
We propose a unified low-rank decomposition framework for compressing CTR prediction models.
Our framework can achieve better performance than the original model.
Our framework can be applied to embedding tables and layers in various CTR prediction models.
arXiv Detail & Related papers (2024-05-28T13:06:32Z) - RTP: Rethinking Tensor Parallelism with Memory Deduplication [3.036340414461332]
Rotated Parallelism (RTP) is an innovative approach that focuses on memory deduplication in distributed training environments.
Our empirical evaluations underscore RTP's efficiency, revealing that its memory consumption during distributed system training is remarkably close to the optimal.
arXiv Detail & Related papers (2023-11-02T23:12:42Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Revisiting RCAN: Improved Training for Image Super-Resolution [94.8765153437517]
We revisit the popular RCAN model and examine the effect of different training options in SR.
We show that RCAN can outperform or match nearly all the CNN-based SR architectures published after RCAN on standard benchmarks.
arXiv Detail & Related papers (2022-01-27T02:20:11Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - A Generic Network Compression Framework for Sequential Recommender
Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations.
We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed.
By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.