Related papers: An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2

Related papers

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency [6.306413686006502]
We conduct a comprehensive analysis of 28 quantized Large Language Models (LLMs) from the Ollama library. We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings.
arXiv Detail & Related papers (2025-04-04T11:29:30Z)
LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi [0.48212500317840945]
Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization techniques to enable high- throughput, energy-efficient execution of LLMs on low-power embedded systems. Our findings highlight the potential of quantized LLMs for real-time conversational AI on edge devices, paving the way for low-power, high-efficiency AI deployment in mobile and embedded applications.
arXiv Detail & Related papers (2025-04-02T20:29:39Z)
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration [5.88033624474104]
HALO is a versatile framework for Hardware-Aware Post-Training Quantization (PTQ) Unlike traditional methods, HALO explicitly incorporates detailed hardware characteristics, including critical-path timing and power consumption. Average performance improvements of 270% and energy savings of 51% over baseline quantization methods.
arXiv Detail & Related papers (2025-02-27T01:08:33Z)
Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning [0.5524804393257919]
We introduce a versatile and computationally efficient approach to stabilizer code weight reduction based on reinforcement learning (RL) Our approach demonstrates savings in physical qubit overhead compared to existing results by 1 to 2 orders of magnitude for weight 6 codes.
arXiv Detail & Related papers (2025-02-20T09:05:34Z)
The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking [72.45765726160151]
This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback and its connection to reward hacking. We propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the final layer during reward calculation to prevent excessive energy loss.
arXiv Detail & Related papers (2025-01-31T18:10:53Z)
Prompt engineering and its implications on the energy consumption of Large Language Models [4.791072577881446]
Large language models (LLMs) in software engineering pose severe challenges regarding computational resources, data centers, and carbon emissions. In this paper, we investigate how prompt engineering techniques (PETs) can impact the carbon emission of the Llama 3 model for the code generation task.
arXiv Detail & Related papers (2025-01-10T11:49:31Z)
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference [16.42907854119748]
MixPE is a specialized mixed-precision processing element designed for efficient low-bit quantization in large language models. We show that MixPE surpasses the state-of-the-art quantization accelerators by $2.6times$ speedup and $1.4times$ energy reduction.
arXiv Detail & Related papers (2024-11-25T07:34:53Z)
Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z)
Impact of ML Optimization Tactics on Greener Pre-Trained ML Models [46.78148962732881]
This study aims to (i) analyze image classification datasets and pre-trained models, (ii) improve inference efficiency by comparing optimized and non-optimized models, and (iii) assess the economic impact of the optimizations. We conduct a controlled experiment to evaluate the impact of various PyTorch optimization techniques (dynamic quantization, torch.compile, local pruning, and global pruning) to 42 Hugging Face models for image classification. Dynamic quantization demonstrates significant reductions in inference time and energy consumption, making it highly suitable for large-scale systems.
arXiv Detail & Related papers (2024-09-19T16:23:03Z)
DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing [5.174900115018253]
We propose a data-free post-training quantization (PTQ) method for efficient Diffusion Transformers (DiTs) DiTAS relies on the proposed temporal-aggregated smoothing techniques to mitigate the impact of the channel-wise outliers within the input activations. We show that our approach enables 4-bit weight, 8-bit activation (W4A8) quantization for DiTs while maintaining comparable performance as the full-precision model.
arXiv Detail & Related papers (2024-09-12T05:18:57Z)
PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks [4.827161693957252]
Non-quantized elementwise operations dominate the inference cost of low-precision models. PikeLPN model addresses these issues by applying quantization to both elementwise operations and multiply-accumulate operations.
arXiv Detail & Related papers (2024-03-29T18:23:34Z)
What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation [55.153595212571375]
Quantization is a technique for improving the memory and computational efficiency of large language models (LLMs) We propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We conduct experiments with various artificial perturbations to explore their impact on LLM performance.
arXiv Detail & Related papers (2024-03-11T03:42:51Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers [53.85087932591237]
NoisyQuant is a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers. Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution. NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead.
arXiv Detail & Related papers (2022-11-29T10:02:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.