Related papers: Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices

Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices

URL: http://arxiv.org/abs/2504.08242v1
Date: Fri, 11 Apr 2025 03:58:59 GMT
Title: Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices
Authors: Shengyuan Ye, Bei Ouyang, Liekang Zeng, Tianyi Qian, Xiaowen Chu, Jian Tang, Xu Chen,
Abstract summary: Generative large language models (LLMs) have garnered significant attention due to their exceptional capabilities in various AI tasks.<n>The limited computational resources of individual edge devices can result in excessively prolonged inference latency and overwhelmed memory usage.<n>We propose Jupiter, a fast, scalable, and resource-efficient collaborative edge AI system for generative LLM inference.
Score: 21.520387382163978
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative large language models (LLMs) have garnered significant attention due to their exceptional capabilities in various AI tasks. Traditionally deployed in cloud datacenters, LLMs are now increasingly moving towards more accessible edge platforms to protect sensitive user data and ensure privacy preservation. The limited computational resources of individual edge devices, however, can result in excessively prolonged inference latency and overwhelmed memory usage. While existing research has explored collaborative edge computing to break the resource wall of individual devices, these solutions yet suffer from massive communication overhead and under-utilization of edge resources. Furthermore, they focus exclusively on optimizing the prefill phase, neglecting the crucial autoregressive decoding phase for generative LLMs. To address that, we propose Jupiter, a fast, scalable, and resource-efficient collaborative edge AI system for generative LLM inference. Jupiter introduces a flexible pipelined architecture as a principle and differentiates its system design according to the differentiated characteristics of the prefill and decoding phases. For prefill phase, Jupiter submits a novel intra-sequence pipeline parallelism and develops a meticulous parallelism planning strategy to maximize resource efficiency; For decoding, Jupiter devises an effective outline-based pipeline parallel decoding mechanism combined with speculative decoding, which further magnifies inference acceleration. Extensive evaluation based on realistic implementation demonstrates that Jupiter remarkably outperforms state-of-the-art approaches under various edge environment setups, achieving up to 26.1x end-to-end latency reduction while rendering on-par generation quality.

Related papers

A Hybrid Swarm Intelligence Approach for Optimizing Multimodal Large Language Models Deployment in Edge-Cloud-based Federated Learning Environments [10.72166883797356]
Federated Learning (FL), Multimodal Large Language Models (MLLMs), and edge-cloud computing enables distributed and real-time data processing.<n>We propose a novel hybrid framework wherein MLLMs are deployed on edge devices equipped with sufficient resources and battery life, while the majority of training occurs in the cloud.<n>Our experimental results show that the proposed method significantly improves system performance, achieving an accuracy of 92%, reducing communication cost by 30%, and enhancing client participation.
arXiv Detail & Related papers (2025-02-04T03:03:24Z)
Activation Sparsity Opportunities for Compressing General Large Language Models [4.5624217435826]
This work systematically investigates the tradeoff between enforcing activation sparsity and perplexity (accuracy) on state-of-the-art AI models.<n>Our empirical analysis demonstrates that we can obtain around 50% of main memory and computing reductions for critical FFN components with negligible accuracy degradation.
arXiv Detail & Related papers (2024-12-13T02:26:54Z)
Split Federated Learning Over Heterogeneous Edge Devices: Algorithm and Optimization [7.013344179232109]
Split Learning (SL) is a promising collaborative machine learning approach, enabling resource-constrained devices to train models without sharing raw data. Current SL algorithms face limitations in training efficiency and suffer from prolonged latency. We propose the Heterogeneous Split Federated Learning framework, which allows resource-constrained clients to train their personalized client-side models in parallel.
arXiv Detail & Related papers (2024-11-21T07:46:01Z)
Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning [13.26886445965894]
Pluto and Charon (PAC) is a time and memory efficient collaborative edge AI framework for personal LLMs fine-tuning. PAC implements a personal LLMs fine-tuning technique that is efficient in terms of parameters, time, and memory. Extensive evaluation based on prototype implementation demonstrates that PAC remarkably outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2024-08-20T11:30:12Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization [20.631476379056892]
Large Language Models (LLMs) are at the forefront of this movement. LLMs require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. We present an edge intelligence optimization problem tailored for LLM inference.
arXiv Detail & Related papers (2024-05-12T02:38:58Z)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z)
Dynamic Scheduling for Federated Edge Learning with Streaming Data [56.91063444859008]
We consider a Federated Edge Learning (FEEL) system where training data are randomly generated over time at a set of distributed edge devices with long-term energy constraints. Due to limited communication resources and latency requirements, only a subset of devices is scheduled for participating in the local training process in every iteration.
arXiv Detail & Related papers (2023-05-02T07:41:16Z)
Phase Retrieval using Expectation Consistent Signal Recovery Algorithm based on Hypernetwork [73.94896986868146]
Phase retrieval is an important component in modern computational imaging systems. Recent advances in deep learning have opened up a new possibility for robust and fast PR. We develop a novel framework for deep unfolding to overcome the existing limitations.
arXiv Detail & Related papers (2021-01-12T08:36:23Z)
Reconfigurable Intelligent Surface Assisted Mobile Edge Computing with Heterogeneous Learning Tasks [53.1636151439562]
Mobile edge computing (MEC) provides a natural platform for AI applications. We present an infrastructure to perform machine learning tasks at an MEC with the assistance of a reconfigurable intelligent surface (RIS) Specifically, we minimize the learning error of all participating users by jointly optimizing transmit power of mobile users, beamforming vectors of the base station, and the phase-shift matrix of the RIS.
arXiv Detail & Related papers (2020-12-25T07:08:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.