Related papers: Generative AI on the Edge: Architecture and Performance Evaluation

Generative AI on the Edge: Architecture and Performance Evaluation

URL: http://arxiv.org/abs/2411.17712v1
Date: Mon, 18 Nov 2024 16:09:01 GMT
Title: Generative AI on the Edge: Architecture and Performance Evaluation
Authors: Zeinab Nezami, Maryam Hafeez, Karim Djemame, Syed Ali Raza Zaidi,
Abstract summary: 6G's AI native vision of embedding advance intelligence in the network requires a systematic evaluation of Generative AI (GenAI) models on edge devices.<n>This research investigates computationally demanding Large Language Models (LLMs) inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN.
Score: 0.3999851878220877
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 6G's AI native vision of embedding advance intelligence in the network while bringing it closer to the user requires a systematic evaluation of Generative AI (GenAI) models on edge devices. Rapidly emerging solutions based on Open RAN (ORAN) and Network-in-a-Box strongly advocate the use of low-cost, off-the-shelf components for simpler and efficient deployment, e.g., in provisioning rural connectivity. In this context, conceptual architecture, hardware testbeds and precise performance quantification of Large Language Models (LLMs) on off-the-shelf edge devices remains largely unexplored. This research investigates computationally demanding LLM inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN. We investigate various LLMs, including small, medium and large models, on a Raspberry Pi 5 Cluster using a lightweight Kubernetes distribution (K3s) with modular prompting implementation. We study its feasibility and limitations by analyzing throughput, latency, accuracy and efficiency. Our findings indicate that CPU-only deployment of lightweight models, such as Yi, Phi, and Llama3, can effectively support edge applications, achieving a generation throughput of 5 to 12 tokens per second with less than 50\% CPU and RAM usage. We conclude that GenAI on the edge offers localized inference in remote or bandwidth-constrained environments in 6G networks without reliance on cloud infrastructure.

Related papers

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving [7.91607650966469]
acronym is a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models.<n>A single, shared edge server verifies the tokens utilizing a more precise target model.<n>Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits.
arXiv Detail & Related papers (2025-06-11T04:55:54Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Programmable and Customized Intelligence for Traffic Steering in 5G Networks Using Open RAN Architectures [16.48682480842328]
5G and beyond mobile networks will support heterogeneous use cases at an unprecedented scale. Such fine-grained control of the Radio Access Network (RAN) is not possible with the current cellular architecture. We propose an open architecture with abstractions that enable closed-loop control and provide data-driven, and intelligent optimization of the RAN at the user level.
arXiv Detail & Related papers (2022-09-28T15:31:06Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Communication-Efficient Separable Neural Network for Distributed Inference on Edge Devices [2.28438857884398]
We propose a novel method of exploiting model parallelism to separate a neural network for distributed inferences. Under proper specifications of devices and configurations of models, our experiments show that the inference of large neural networks on edge clusters can be distributed and accelerated.
arXiv Detail & Related papers (2021-11-03T19:30:28Z)
Computational Intelligence and Deep Learning for Next-Generation Edge-Enabled Industrial IoT [51.68933585002123]
We investigate how to deploy computational intelligence and deep learning (DL) in edge-enabled industrial IoT networks. In this paper, we propose a novel multi-exit-based federated edge learning (ME-FEEL) framework. In particular, the proposed ME-FEEL can achieve an accuracy gain up to 32.7% in the industrial IoT networks with the severely limited resources.
arXiv Detail & Related papers (2021-10-28T08:14:57Z)
Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments. It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z)
Edge-Detect: Edge-centric Network Intrusion Detection using Deep Neural Network [0.0]
Edge nodes are crucial for detection against multitudes of cyber attacks on Internet-of-Things endpoints. We develop a novel light, fast and accurate 'Edge-Detect' model, which detects Denial of Service attack on edge nodes using DLM techniques.
arXiv Detail & Related papers (2021-02-03T04:24:34Z)
Cluster Pruning: An Efficient Filter Pruning Method for Edge AI Vision Applications [13.197955183748796]
A novel greedy approach called cluster pruning has been proposed, which provides a structured way of removing filters in a CNN. A low cost IoT hardware setup consisting of an Intel Movidius-NCS is proposed to deploy an edge-AI application using our proposed pruning methodology.
arXiv Detail & Related papers (2020-03-05T06:20:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.