On Accelerating Edge AI: Optimizing Resource-Constrained Environments
- URL: http://arxiv.org/abs/2501.15014v2
- Date: Tue, 28 Jan 2025 20:29:44 GMT
- Title: On Accelerating Edge AI: Optimizing Resource-Constrained Environments
- Authors: Jacob Sander, Achraf Cohen, Venkat R. Dasari, Brent Venable, Brian Jalaian,
- Abstract summary: Resource-constrained edge deployments demand AI solutions that balance high performance with stringent compute, memory, and energy limitations.
We present a comprehensive overview of the primary strategies for accelerating deep learning models under such constraints.
- Score: 1.7355861031903428
- License:
- Abstract: Resource-constrained edge deployments demand AI solutions that balance high performance with stringent compute, memory, and energy limitations. In this survey, we present a comprehensive overview of the primary strategies for accelerating deep learning models under such constraints. First, we examine model compression techniques-pruning, quantization, tensor decomposition, and knowledge distillation-that streamline large models into smaller, faster, and more efficient variants. Next, we explore Neural Architecture Search (NAS), a class of automated methods that discover architectures inherently optimized for particular tasks and hardware budgets. We then discuss compiler and deployment frameworks, such as TVM, TensorRT, and OpenVINO, which provide hardware-tailored optimizations at inference time. By integrating these three pillars into unified pipelines, practitioners can achieve multi-objective goals, including latency reduction, memory savings, and energy efficiency-all while maintaining competitive accuracy. We also highlight emerging frontiers in hierarchical NAS, neurosymbolic approaches, and advanced distillation tailored to large language models, underscoring open challenges like pre-training pruning for massive networks. Our survey offers practical insights, identifies current research gaps, and outlines promising directions for building scalable, platform-independent frameworks to accelerate deep learning models at the edge.
Related papers
- Low-Rank Adapters Meet Neural Architecture Search for LLM Compression [1.8434042562191815]
The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment.
Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models.
This paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques.
arXiv Detail & Related papers (2025-01-23T02:14:08Z) - A Survey on Inference Optimization Techniques for Mixture of Experts Models [50.40325411764262]
Large-scale Mixture of Experts (MoE) models offer enhanced model capacity and computational efficiency through conditional computation.
deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency.
This survey analyzes optimization techniques for MoE models across the entire system stack.
arXiv Detail & Related papers (2024-12-18T14:11:15Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Neural Architecture Codesign for Fast Bragg Peak Analysis [1.7081438846690533]
We develop an automated pipeline to streamline neural architecture codesign for fast, real-time Bragg peak analysis in microscopy.
Our method employs neural architecture search and AutoML to enhance these models, including hardware costs, leading to the discovery of more hardware-efficient neural architectures.
arXiv Detail & Related papers (2023-12-10T19:42:18Z) - Complexity-Driven CNN Compression for Resource-constrained Edge AI [1.6114012813668934]
We propose a novel and computationally efficient pruning pipeline by exploiting the inherent layer-level complexities of CNNs.
We define three modes of pruning, namely parameter-aware (PA), FLOPs-aware (FA), and memory-aware (MA), to introduce versatile compression of CNNs.
arXiv Detail & Related papers (2022-08-26T16:01:23Z) - FreeREA: Training-Free Evolution-based Architecture Search [17.202375422110553]
FreeREA is a custom cell-based evolution NAS algorithm that exploits an optimised combination of training-free metrics to rank architectures.
Our experiments, carried out on the common benchmarks NAS-Bench-101 and NATS-Bench, demonstrate that i) FreeREA is a fast, efficient, and effective search method for models automatic design.
arXiv Detail & Related papers (2022-06-17T11:16:28Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [94.80212602202518]
We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS)
We employ a one-shot architecture search approach in order to obtain a reduced search cost.
We achieve state-of-the-art results in terms of accuracy-speed trade-off.
arXiv Detail & Related papers (2020-09-29T11:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.