Related papers: PartialLoading: User Scheduling and Bandwidth Allocation for Parameter-sharing Edge Inference

PartialLoading: User Scheduling and Bandwidth Allocation for Parameter-sharing Edge Inference

URL: http://arxiv.org/abs/2503.22982v1
Date: Sat, 29 Mar 2025 05:58:07 GMT
Title: PartialLoading: User Scheduling and Bandwidth Allocation for Parameter-sharing Edge Inference
Authors: Guanqiao Qu, Qian Chen, Xianhao Chen, Kaibin Huang, Yuguang Fang,
Abstract summary: We develop a parameter-sharing AI model loading framework for multi-user edge inference.<n>We exploit shared parameter blocks across models to maximize task throughput.<n>We show that the proposed framework significantly improves task throughput under deadline compared with user scheduling.
Score: 32.58445942857626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: By provisioning inference offloading services, edge inference drives the rapid growth of AI applications at the network edge. However, achieving high task throughput with stringent latency requirements remains a significant challenge. To address this issue, we develop a parameter-sharing AI model loading (PartialLoading) framework for multi-user edge inference, which exploits two key insights: 1) the majority of latency arises from loading AI models into server GPU memory, and 2) different AI models can share a significant number of parameters, for which redundant loading should be avoided. Towards this end, we formulate a joint multi-user scheduling and spectrum bandwidth allocation problem to maximize task throughput by exploiting shared parameter blocks across models. The intuition is to judiciously schedule user requests to reuse the shared parameter blocks between consecutively loaded models, thereby reducing model loading time substantially. To facilitate solution finding, we decouple the problem into two sub-problems, i.e., user scheduling and bandwidth allocation, showing that solving them sequentially is equivalent to solving the original problem. Due to the NP-hardness of the problem, we first study an important special case called the "bottom-layer-sharing" case, where AI models share some bottom layers within clusters, and design a dynamic programming-based algorithm to obtain the optimal solution in polynomial time. For the general case, where shared parameter blocks appear at arbitrary positions within AI models, we propose a greedy heuristic to obtain the sub-optimal solution efficiently. Simulation results demonstrate that the proposed framework significantly improves task throughput under deadline constraints compared with user scheduling without exploiting parameter sharing.

Related papers

Fine-Grained AI Model Caching and Downloading With Coordinated Multipoint Broadcasting in Multi-Cell Edge Networks [19.348574424115935]
6G networks are envisioned to support on-demand AI model downloading to accommodate diverse inference requirements of end users.<n>The substantial size of contemporary AI models poses significant challenges for edge caching under limited storage capacity.<n>We propose a fine-grained AI model caching and downloading system that exploits parameter reusability.
arXiv Detail & Related papers (2025-09-16T09:14:15Z)
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z)
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models [57.49136894315871]
New paradigm of test-time scaling has yielded remarkable breakthroughs in reasoning models and generative vision models.<n>We propose one solution to the problem of integrating test-time scaling knowledge into a model during post-training.<n>We replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise.
arXiv Detail & Related papers (2025-08-13T17:33:37Z)
Privacy-Aware Joint DNN Model Deployment and Partition Optimization for Delay-Efficient Collaborative Edge Inference [14.408050197587654]
Edge inference (EI) is a key solution to address the growing challenges of delayed response times, limited scalability, and privacy concerns in cloud-based Deep Neural Network (DNN) inference.<n>This paper proposes a novel framework for privacy-aware joint DNN model deployment and partition optimization to minimize long-term average inference delay under resource and privacy constraints.
arXiv Detail & Related papers (2025-02-22T05:27:24Z)
Two-Timescale Model Caching and Resource Allocation for Edge-Enabled AI-Generated Content Services [55.0337199834612]
Generative AI (GenAI) has emerged as a transformative technology, enabling customized and personalized AI-generated content (AIGC) services. These services require executing GenAI models with billions of parameters, posing significant obstacles to resource-limited wireless edge. We introduce the formulation of joint model caching and resource allocation for AIGC services to balance a trade-off between AIGC quality and latency metrics.
arXiv Detail & Related papers (2024-11-03T07:01:13Z)
TrimCaching: Parameter-sharing AI Model Caching in Wireless Edge Networks [36.39118138582416]
Next-generation mobile networks are expected to facilitate fast AI model downloading to end users. By caching models on edge servers, mobile networks can deliver models to end users with low latency. We develop a novel model placement scheme, called parameter-sharing model caching (TrimCaching)
arXiv Detail & Related papers (2024-05-07T04:08:49Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Efficient Multiuser AI Downloading via Reusable Knowledge Broadcasting [36.95383755941367]
In-situ model downloading has emerged as an important use case to enable real-time adaptive artificial intelligence on edge devices. We propose the framework of model broadcasting and assembling (MBA) to overcome the bottleneck. Extensive experiments demonstrate the substantial reduction in downloading latency achieved by the proposed MBA compared to traditional model downloading.
arXiv Detail & Related papers (2023-07-28T05:30:19Z)
TIES-Merging: Resolving Interference When Merging Models [95.59265307318752]
Transfer learning can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. Model merging has emerged as a solution to combine multiple task-specific models into a single model without performing additional training. Existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. We propose TIES-Merging, which introduces three novel steps when merging models: resetting parameters that only changed a small amount during fine-tuning, resolving sign conflicts, and merging only the parameters that are in alignment with the final agreed-upon sign.
arXiv Detail & Related papers (2023-06-02T17:31:32Z)
Learning from Images: Proactive Caching with Parallel Convolutional Neural Networks [94.85780721466816]
A novel framework for proactive caching is proposed in this paper. It combines model-based optimization with data-driven techniques by transforming an optimization problem into a grayscale image. Numerical results show that the proposed scheme can reduce 71.6% computation time with only 0.8% additional performance cost.
arXiv Detail & Related papers (2021-08-15T21:32:47Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
Artificial Intelligence Assisted Collaborative Edge Caching in Small Cell Networks [19.605382256630538]
This paper considers heterogeneous content preference of the users with heterogeneous caching models at the edge nodes. We propose a modified particle swarm optimization (M-PSO) algorithm that efficiently solves the complex constraint problem in a reasonable time.
arXiv Detail & Related papers (2020-05-16T10:39:46Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.