Top-KAST: Top-K Always Sparse Training
- URL: http://arxiv.org/abs/2106.03517v1
- Date: Mon, 7 Jun 2021 11:13:05 GMT
- Title: Top-KAST: Top-K Always Sparse Training
- Authors: Siddhant M. Jayakumar, Razvan Pascanu, Jack W. Rae, Simon Osindero,
Erich Elsen
- Abstract summary: We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
- Score: 50.05611544535801
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse neural networks are becoming increasingly important as the field seeks
to improve the performance of existing models by scaling them up, while
simultaneously trying to reduce power consumption and computational footprint.
Unfortunately, most existing methods for inducing performant sparse models
still entail the instantiation of dense parameters, or dense gradients in the
backward-pass, during training. For very large models this requirement can be
prohibitive. In this work we propose Top-KAST, a method that preserves constant
sparsity throughout training (in both the forward and backward-passes). We
demonstrate the efficacy of our approach by showing that it performs comparably
to or better than previous works when training models on the established
ImageNet benchmark, whilst fully maintaining sparsity. In addition to our
ImageNet results, we also demonstrate our approach in the domain of language
modeling where the current best performing architectures tend to have tens of
billions of parameters and scaling up does not yet seem to have saturated
performance. Sparse versions of these architectures can be run with
significantly fewer resources, making them more widely accessible and
applicable. Furthermore, in addition to being effective, our approach is
straightforward and can easily be implemented in a wide range of existing
machine learning frameworks with only a few additional lines of code. We
therefore hope that our contribution will help enable the broader community to
explore the potential held by massive models, without incurring massive
computational cost.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - On Accelerating Edge AI: Optimizing Resource-Constrained Environments [1.7355861031903428]
Resource-constrained edge deployments demand AI solutions that balance high performance with stringent compute, memory, and energy limitations.
We present a comprehensive overview of the primary strategies for accelerating deep learning models under such constraints.
arXiv Detail & Related papers (2025-01-25T01:37:03Z) - Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning.
Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively.
To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z) - Efficient Ternary Weight Embedding Model: Bridging Scalability and Performance [15.877771709013743]
In this work, we propose a novel finetuning framework to ternary-weight embedding models.
To apply ternarization to pre-trained embedding models, we introduce self-taught knowledge distillation to finalize the ternary-weights of the linear layers.
With extensive experiments on public text and vision datasets, we demonstrated that without sacrificing effectiveness, the ternarized model consumes low memory usage.
arXiv Detail & Related papers (2024-11-23T03:44:56Z) - Efficiently Robustify Pre-trained Models [18.392732966487582]
robustness of large scale models towards real-world settings is still a less-explored topic.
We first benchmark the performance of these models under different perturbations and datasets.
We then discuss on how complete model fine-tuning based existing robustification schemes might not be a scalable option given very large scale networks.
arXiv Detail & Related papers (2023-09-14T08:07:49Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.