An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2
- URL: http://arxiv.org/abs/2411.12758v1
- Date: Fri, 15 Nov 2024 21:28:19 GMT
- Title: An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2
- Authors: Pepijn de Reus, Ana Oprescu, Jelle Zuidema,
- Abstract summary: This study examines quantisation and pruning strategies to reduce energy consumption in code Large Language Models (LLMs) inference.
We observe increased energy demands with quantization due to lower throughput and some accuracy losses.
We suggest future work on hardware-optimized quantization to enhance efficiency with minimal loss in accuracy.
- Score: 0.0
- License:
- Abstract: This study examines quantisation and pruning strategies to reduce energy consumption in code Large Language Models (LLMs) inference. Using StarCoder2, we observe increased energy demands with quantization due to lower throughput and some accuracy losses. Conversely, pruning reduces energy usage but impairs performance. The results highlight challenges and trade-offs in LLM model compression. We suggest future work on hardware-optimized quantization to enhance efficiency with minimal loss in accuracy.
Related papers
- Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning [0.5524804393257919]
We introduce a versatile and computationally efficient approach to stabilizer code weight reduction based on reinforcement learning (RL)
Our approach demonstrates savings in physical qubit overhead compared to existing results by 1 to 2 orders of magnitude for weight 6 codes.
arXiv Detail & Related papers (2025-02-20T09:05:34Z) - The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking [72.45765726160151]
This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback and its connection to reward hacking.
We propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the final layer during reward calculation to prevent excessive energy loss.
arXiv Detail & Related papers (2025-01-31T18:10:53Z) - Prompt engineering and its implications on the energy consumption of Large Language Models [4.791072577881446]
Large language models (LLMs) in software engineering pose severe challenges regarding computational resources, data centers, and carbon emissions.
In this paper, we investigate how prompt engineering techniques (PETs) can impact the carbon emission of the Llama 3 model for the code generation task.
arXiv Detail & Related papers (2025-01-10T11:49:31Z) - Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference.
We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy.
Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z) - Impact of ML Optimization Tactics on Greener Pre-Trained ML Models [46.78148962732881]
This study aims to (i) analyze image classification datasets and pre-trained models, (ii) improve inference efficiency by comparing optimized and non-optimized models, and (iii) assess the economic impact of the optimizations.
We conduct a controlled experiment to evaluate the impact of various PyTorch optimization techniques (dynamic quantization, torch.compile, local pruning, and global pruning) to 42 Hugging Face models for image classification.
Dynamic quantization demonstrates significant reductions in inference time and energy consumption, making it highly suitable for large-scale systems.
arXiv Detail & Related papers (2024-09-19T16:23:03Z) - DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing [5.174900115018253]
We propose a data-free post-training quantization (PTQ) method for efficient Diffusion Transformers (DiTs)
DiTAS relies on the proposed temporal-aggregated smoothing techniques to mitigate the impact of the channel-wise outliers within the input activations.
We show that our approach enables 4-bit weight, 8-bit activation (W4A8) quantization for DiTs while maintaining comparable performance as the full-precision model.
arXiv Detail & Related papers (2024-09-12T05:18:57Z) - What Makes Quantization for Large Language Models Hard? An Empirical
Study from the Lens of Perturbation [55.153595212571375]
Quantization is a technique for improving the memory and computational efficiency of large language models (LLMs)
We propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs.
We conduct experiments with various artificial perturbations to explore their impact on LLM performance.
arXiv Detail & Related papers (2024-03-11T03:42:51Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.