Layer-wise Weight Selection for Power-Efficient Neural Network Acceleration
- URL: http://arxiv.org/abs/2511.17123v2
- Date: Mon, 24 Nov 2025 15:02:34 GMT
- Title: Layer-wise Weight Selection for Power-Efficient Neural Network Acceleration
- Authors: Jiaxun Fang, Grace Li Zhang, Shaoyi Huang,
- Abstract summary: Systolic array accelerators execute CNNs with energy dominated by multiply accumulate (MAC) units.<n>We propose an energy aware, layer-wise compression framework that explicitly leverages MAC and layer level energy characteristics.<n> Experiments on different CNN models demonstrate up to 58.6% energy reduction with 2-3% accuracy drop, outperforming a state-of-the-art power-aware baseline.
- Score: 3.2135370553216496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Systolic array accelerators execute CNNs with energy dominated by the switching activity of multiply accumulate (MAC) units. Although prior work exploits weight dependent MAC power for compression, existing methods often use global activation models, coarse energy proxies, or layer-agnostic policies, which limits their effectiveness on real hardware. We propose an energy aware, layer-wise compression framework that explicitly leverages MAC and layer level energy characteristics. First, we build a layer-aware MAC energy model that combines per-layer activation statistics with an MSB-Hamming distance grouping of 22-bit partial sum transitions, and integrate it with a tile-level systolic mapping to estimate convolution-layer energy. On top of this model, we introduce an energy accuracy co-optimized weight selection algorithm within quantization aware training and an energy-prioritized layer-wise schedule that compresses high energy layers more aggressively under a global accuracy constraint. Experiments on different CNN models demonstrate up to 58.6\% energy reduction with 2-3\% accuracy drop, outperforming a state-of-the-art power-aware baseline.
Related papers
- Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation [50.21021246855702]
We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs)<n>Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps.<n>Our results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
arXiv Detail & Related papers (2025-11-21T08:12:47Z) - Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation [18.963480523099694]
Spiking Neural Networks (SNNs) support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs)<n>We propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts.<n> Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods.
arXiv Detail & Related papers (2025-10-22T11:50:00Z) - Compression of Site-Specific Deep Neural Networks for Massive MIMO Precoding [4.8310710966636545]
In this paper, we investigate the compute energy efficiency of mMIMO precoders using deep learning approaches.<n>We propose a framework that incorporates mixed-precision quantization-aware training and neural architecture search to reduce energy usage.<n>Our results show that deep neural network compression generates precoders with up to 35 times higher energy efficiency than WMMSE at equal performance.
arXiv Detail & Related papers (2025-02-12T20:03:32Z) - Enhancing User Experience in On-Device Machine Learning with Gated Compression Layers [0.0]
On-device machine learning (ODML) enables powerful edge applications, but power consumption remains a key challenge for resource-constrained devices.
This work focuses on the use of Gated Compression (GC) layer to enhance ODML model performance while conserving power.
GC layers dynamically regulate data flow by selectively gating activations of neurons within the neural network and effectively filtering out non-essential inputs.
arXiv Detail & Related papers (2024-05-02T21:18:06Z) - Ultra-low Precision Multiplication-free Training for Deep Neural
Networks [20.647925576138807]
In training, the linear layers consume the most energy because of the intense use of energy-consuming full-precision multiplication.
We propose an Adaptive Layer-wise Scaling PoT Quantization (ALS-POTQ) method and a multiplication-Free MAC (MF-MAC) to replace all of the FP32 multiplications.
In our training scheme, all of the above methods do not introduce extra multiplications, so we reduce up to 95.8% of the energy consumption in linear layers during training.
arXiv Detail & Related papers (2023-02-28T10:05:45Z) - Energy Transformer [64.22957136952725]
Our work combines aspects of three promising paradigms in machine learning, namely, attention mechanism, energy-based models, and associative memory.
We propose a novel architecture, called the Energy Transformer (or ET for short), that uses a sequence of attention layers that are purposely designed to minimize a specifically engineered energy function.
arXiv Detail & Related papers (2023-02-14T18:51:22Z) - PhAST: Physics-Aware, Scalable, and Task-specific GNNs for Accelerated
Catalyst Design [102.9593507372373]
Catalyst materials play a crucial role in the electrochemical reactions involved in industrial processes.
Machine learning holds the potential to efficiently model materials properties from large amounts of data.
We propose task-specific innovations applicable to most architectures, enhancing both computational efficiency and accuracy.
arXiv Detail & Related papers (2022-11-22T05:24:30Z) - Energy Efficient Hardware Acceleration of Neural Networks with
Power-of-Two Quantisation [0.0]
We show that a hardware neural network accelerator with PoT weights implemented on the Zynq UltraScale + MPSoC ZCU104 FPGA can be at least $1.4x$ more energy efficient than the uniform quantisation version.
arXiv Detail & Related papers (2022-09-30T06:33:40Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Collaborative Intelligent Reflecting Surface Networks with Multi-Agent
Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks.
In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z) - Energy-Efficient and Federated Meta-Learning via Projected Stochastic
Gradient Ascent [79.58680275615752]
We propose an energy-efficient federated meta-learning framework.
We assume each task is owned by a separate agent, so a limited number of tasks is used to train a meta-model.
arXiv Detail & Related papers (2021-05-31T08:15:44Z) - Multi-Agent Meta-Reinforcement Learning for Self-Powered and Sustainable
Edge Computing Systems [87.4519172058185]
An effective energy dispatch mechanism for self-powered wireless networks with edge computing capabilities is studied.
A novel multi-agent meta-reinforcement learning (MAMRL) framework is proposed to solve the formulated problem.
Experimental results show that the proposed MAMRL model can reduce up to 11% non-renewable energy usage and by 22.4% the energy cost.
arXiv Detail & Related papers (2020-02-20T04:58:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.