Hardware-Aware Reformulation of Convolutions for Efficient Execution on Specialized AI Hardware: A Case Study on NVIDIA Tensor Cores
- URL: http://arxiv.org/abs/2601.11608v1
- Date: Fri, 09 Jan 2026 02:28:14 GMT
- Title: Hardware-Aware Reformulation of Convolutions for Efficient Execution on Specialized AI Hardware: A Case Study on NVIDIA Tensor Cores
- Authors: Ganesh Bikshandi,
- Abstract summary: NVIDIA Cores, for instance, require input channels to be multiples of 8 and sometimes 512 for efficient execution.<n>Traditional approaches address such alignment issue using zero-padding, which can be inefficient.<n>We present a first-step, hardware-aware reformulation of CNN computations using rewrite rules.
- Score: 0.6138671548064355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional Neural Networks (CNNs) are central to modern AI, but their performance is often limited by hardware constraints. NVIDIA Tensor Cores, for instance, require input channels to be multiples of 8 and sometimes 512 for efficient execution. {\em oneDNN} framework for CPU imposes such a requirement for the blocked format. Traditional approaches address such alignment issue using zero-padding, which can be inefficient. In this work, we present a first-step, hardware-aware reformulation of CNN computations using rewrite rules, restructuring the underlying math to satisfy hardware alignment entirely {\bf post-training} without modifying network weights. While our current implementation focuses on a single transformation for Tensor Cores, this approach is generalizable, laying the foundation to explore additional transformations for CPU and accelerators. This study represents an initial step toward {\em semantic tuning}, a systematic, hardware-aware optimization strategy for efficient deployment of CNN models on specialized AI hardware.
Related papers
- FlashRNN: I/O-Aware Optimization of Traditional RNNs on modern hardware [6.749483762719583]
State-tracking capabilities are important for time-series tasks and logical reasoning.<n>Traditional RNNs like LSTMs and GRUs do have these capabilities at the cost of strictly sequential processing.<n>We show how fast these networks can get with our hardware-optimization FlashRNN in Triton and optimized kernels to the register level.
arXiv Detail & Related papers (2024-12-10T18:50:37Z) - Tricking AI chips into Simulating the Human Brain: A Detailed
Performance Analysis [0.5354801701968198]
We evaluate multiple, cutting-edge AI-chips (Graphcore IPU, GroqChip, Nvidia GPU with inferior Cores and Google TPU) for brain simulation.
Our performance analysis reveals that the simulation problem maps extremely well onto the GPU and TPU architectures.
The GroqChip outperforms both platforms for small networks but, due to implementing some floating-point operations at reduced accuracy, is found not yet usable for brain simulation.
arXiv Detail & Related papers (2023-01-31T13:51:37Z) - Training Integer-Only Deep Recurrent Neural Networks [3.1829446824051195]
We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN)
Our approach supports layer normalization, attention, and an adaptive piecewise linear (PWL) approximation of activation functions.
The proposed method enables RNN-based language models to run on edge devices with $2times$ improvement in runtime.
arXiv Detail & Related papers (2022-12-22T15:22:36Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - AEGNN: Asynchronous Event-based Graph Neural Networks [54.528926463775946]
Event-based Graph Neural Networks generalize standard GNNs to process events as "evolving"-temporal graphs.
AEGNNs are easily trained on synchronous inputs and can be converted to efficient, "asynchronous" networks at test time.
arXiv Detail & Related papers (2022-03-31T16:21:12Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - A Long Short-Term Memory for AI Applications in Spike-based Neuromorphic
Hardware [1.3381749415517017]
It has an open problem to what extent current Artificial Intelligence (AI) methods that employ Deep Neural Networks (DNNs) can be implemented more energy-efficiently on spike-based neuromorphic hardware.
We present a biologically inspired solution that solves this problem.
This solution enables us to implement a major class of DNNs for sequence processing tasks such as time series classification and question answering with substantial energy savings on neuromorphic hardware.
arXiv Detail & Related papers (2021-07-08T17:37:02Z) - Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers.
We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - cuConv: A CUDA Implementation of Convolution for CNN Inference [0.0]
Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs)
We propose a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations.
Our experiments demonstrate that our proposal yields notable performance improvements in a range of common CNN forward propagation convolution configurations.
arXiv Detail & Related papers (2021-03-30T10:33:53Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Binary Graph Neural Networks [69.51765073772226]
Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data.
In this paper, we present and evaluate different strategies for the binarization of graph neural networks.
We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks.
arXiv Detail & Related papers (2020-12-31T18:48:58Z) - Efficient Computation Reduction in Bayesian Neural Networks Through
Feature Decomposition and Memorization [10.182119276564643]
In this paper, an efficient BNN inference flow is proposed to reduce the computation cost.
About half of the computations could be eliminated compared to the traditional approach.
We implement our approach in Verilog and synthesise it with 45 $nm$ FreePDK technology.
arXiv Detail & Related papers (2020-05-08T05:03:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.