A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications
- URL: http://arxiv.org/abs/2602.04044v1
- Date: Tue, 03 Feb 2026 22:24:28 GMT
- Title: A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications
- Authors: Panagiotis Mousouliotis, Georgios Keramidas,
- Abstract summary: Real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost.<n>This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools.
- Score: 2.4923006485141284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural network (CNN) accelerators implemented on Field-Programmable Gate Arrays (FPGAs) are typically designed with a primary focus on maximizing performance, often measured in giga-operations per second (GOPS). However, real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost. This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools that ease the parameterization of the design, facilitating more effective optimizations across multiple design constraints. Our experimental results demonstrate that the proposed design methodology is able to outperform non-parameterized design approaches, and it can be easily extended to other types of DL applications.
Related papers
- Implémentation Efficiente de Fonctions de Convolution sur FPGA à l'Aide de Blocs Paramétrables et d'Approximations Polynomiales [0.3966519779235704]
Implementing convolutional neural networks (CNNs) on field-programmable gate arrays (FPGAs) has emerged as a promising alternative to GPUs.<n>This paper proposes a library of convolution Blocks designed to optimize FPGA implementation and adapt to available resources.<n>It also presents a methodological framework for developing mathematical models that predict FPGA resources utilization.
arXiv Detail & Related papers (2025-10-03T15:58:20Z) - Deep Learning-based Techniques for Integrated Sensing and Communication Systems: State-of-the-Art, Challenges, and Opportunities [54.12860202362483]
This article comprehensively reviews recent developments and research on deep learning-based (DL-based) techniques for integrated sensing and communication (ISAC) systems.<n>ISAC is regarded as a key enabler for 6G and beyond networks, as many emerging applications, such as vehicular networks and industrial robotics, necessitate both sensing and communication capabilities.<n>As an alternative to conventional techniques, DL-based techniques offer efficient and near-optimal solutions with reduced computational complexity.
arXiv Detail & Related papers (2025-08-23T22:27:51Z) - MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration [8.43012094714496]
This paper presents a unified framework for codifying and automating optimization strategies to deploy deep neural networks (DNNs) on resource-constrained hardware.<n>Our novel approach addresses two key issues: (i)encoding custom optimization strategies and (ii)enabling cross-stage optimization search.
arXiv Detail & Related papers (2025-02-09T11:02:06Z) - DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - A Graph Deep Learning Framework for High-Level Synthesis Design Space
Exploration [11.154086943903696]
High-Level Synthesis is a solution for fast prototyping application-specific hardware.
We propose HLS, for the first time in the literature, graph neural networks that jointly predict acceleration performance and hardware costs.
We show that our approach achieves prediction accuracy comparable with that of commonly used simulators.
arXiv Detail & Related papers (2021-11-29T18:17:45Z) - A Construction Kit for Efficient Low Power Neural Network Accelerator
Designs [11.807678100385164]
This work provides a survey of neural network accelerator optimization approaches that have been used in recent works.
It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately.
arXiv Detail & Related papers (2021-06-24T07:53:56Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Scalable Deep-Learning-Accelerated Topology Optimization for Additively
Manufactured Materials [4.221095652322005]
Topology optimization (TO) is a popular and powerful computational approach for designing novel structures, materials, and devices.
To address these issues, we propose a general scalable deep-learning (DL) based TO framework, referred to as SDL-TO.
Our framework accelerates TO by learning the iterative history data and simultaneously training on the mapping between the given design and its gradient.
arXiv Detail & Related papers (2020-11-28T17:38:31Z) - Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices.
We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT)
Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.