PruMUX: Augmenting Data Multiplexing with Model Compression
- URL: http://arxiv.org/abs/2305.14706v2
- Date: Wed, 23 Aug 2023 21:22:01 GMT
- Title: PruMUX: Augmenting Data Multiplexing with Model Compression
- Authors: Yushan Su, Vishvak Murahari, Karthik Narasimhan, Kai Li
- Abstract summary: In this paper, we combine two such methods -- structured pruning and data multiplexing -- to compound the speedup gains obtained by either method.
Our approach, PruMUX, obtains up to 7.5-29.5X throughput improvement over BERT-base model with accuracy threshold from 80% to 74%.
We propose Auto-PruMUX, a meta-level model that can predict the high-performance parameters for pruning and multiplexing given a desired accuracy loss budget.
- Score: 42.89593283051397
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As language models increase in size by the day, methods for efficient
inference are critical to leveraging their capabilities for various
applications. Prior work has investigated techniques like model pruning,
knowledge distillation, and data multiplexing to increase model throughput
without sacrificing accuracy. In this paper, we combine two such methods --
structured pruning and data multiplexing -- to compound the speedup gains
obtained by either method. Our approach, PruMUX, obtains up to 7.5-29.5X
throughput improvement over BERT-base model with accuracy threshold from 80% to
74%. We further study various combinations of parameters (such as sparsity and
multiplexing factor) in the two techniques to provide a comprehensive analysis
of the tradeoff between accuracy and throughput in the resulting models. We
then propose Auto-PruMUX, a meta-level model that can predict the
high-performance parameters for pruning and multiplexing given a desired
accuracy loss budget, providing a practical method to leverage the combination
effectively.
Related papers
- Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Hybrid Deep Convolutional Neural Networks Combined with Autoencoders And Augmented Data To Predict The Look-Up Table 2006 [2.082445711353476]
This study explores the development of a hybrid deep convolutional neural network (DCNN) model enhanced by autoencoders and data augmentation techniques.
By augmenting the original input features using three different autoencoder configurations, the model's predictive capabilities were significantly improved.
arXiv Detail & Related papers (2024-08-26T20:45:07Z) - Effective Interplay between Sparsity and Quantization: From Theory to Practice [33.697590845745815]
Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy.
We investigate the interaction between these two methods and assess whether their combination impacts final model accuracy.
Our findings extend to the efficient deployment of large models in resource-limited compute platforms and reduce serving cost.
arXiv Detail & Related papers (2024-05-31T15:34:13Z) - Fairer and More Accurate Tabular Models Through NAS [14.147928131445852]
We propose using multi-objective Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) in the first application to the very challenging domain of tabular data.
We show that models optimized solely for accuracy with NAS often fail to inherently address fairness concerns.
We produce architectures that consistently dominate state-of-the-art bias mitigation methods either in fairness, accuracy or both.
arXiv Detail & Related papers (2023-10-18T17:56:24Z) - Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks.
We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z) - Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How [62.467716468917224]
We propose a methodology that jointly searches for the optimal pretrained model and the hyperparameters for finetuning it.
Our method transfers knowledge about the performance of many pretrained models on a series of datasets.
We empirically demonstrate that our resulting approach can quickly select an accurate pretrained model for a new dataset.
arXiv Detail & Related papers (2023-06-06T16:15:26Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning)
CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency.
Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z) - Fast, Accurate, and Simple Models for Tabular Data via Augmented
Distillation [97.42894942391575]
We propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks.
Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.
arXiv Detail & Related papers (2020-06-25T09:57:47Z) - Efficient Ensemble Model Generation for Uncertainty Estimation with
Bayesian Approximation in Segmentation [74.06904875527556]
We propose a generic and efficient segmentation framework to construct ensemble segmentation models.
In the proposed method, ensemble models can be efficiently generated by using the layer selection method.
We also devise a new pixel-wise uncertainty loss, which improves the predictive performance.
arXiv Detail & Related papers (2020-05-21T16:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.