HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network
- URL: http://arxiv.org/abs/2601.11676v1
- Date: Fri, 16 Jan 2026 07:37:23 GMT
- Title: HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network
- Authors: Peirong Zheng, Wenchao Xu, Haozhao Wang, Jinyu Chen, Xuemin Shen,
- Abstract summary: Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
- Score: 50.33808558714122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The deployment of large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy. However, it is critically challenged by the resource constraints of a single edge node. Distributed inference has emerged to aggregate and leverage computational resources across multiple devices. Yet, existing methods typically require strict synchronization, which is often infeasible due to the unreliable network conditions. In this paper, we propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network. The core idea is to enable a relaxed yet effective synchronization by strategically allocating less critical neuron groups to unstable devices, thus avoiding the excessive waiting time incurred by delayed packets. HALO introduces three key mechanisms: (1) a semantic-aware predictor to assess the significance of neuron groups prior to activation. (2) a parallel execution scheme of neuron group loading during the model inference. (3) a load-balancing scheduler that efficiently orchestrates multiple devices with heterogeneous resources. Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions. It maintains performance comparable to optimal conditions and significantly outperforms the state-of-the-art in various scenarios.
Related papers
- Wireless Federated Multi-Task LLM Fine-Tuning via Sparse-and-Orthogonal LoRA [61.12136997430116]
Decentralized federated learning (DFL) based on low-rank adaptation (LoRA) enables mobile devices with multi-task datasets to collaboratively fine-tune a large language model (LLM) by exchanging locally updated parameters with a subset of neighboring devices via wireless connections for knowledge integration.<n> directly aggregating parameters fine-tuned on heterogeneous datasets induces three primary issues across the DFL life-cycle: (i) catastrophic knowledge forgetting during fine-tuning process, arising from conflicting update directions caused by data heterogeneity; (ii) textitinefficient communication and convergence during model aggregation process,
arXiv Detail & Related papers (2026-02-24T02:45:32Z) - ELSA: Efficient LLM-Centric Split Aggregation for Privacy-Aware Hierarchical Federated Learning over Resource-Constrained Edge Networks [22.53431546014934]
Training large language models (LLMs) at the network edge faces fundamental challenges arising from device resource constraints, severe data heterogeneity, and heightened privacy risks.<n>We propose ELSA, a novel framework that integrates split learning (SL) and hierarchical federated learning (HFL) for distributed LLM fine-tuning over resource-constrained edge networks.<n>First, it employs a task-agnostic, behavior-aware client clustering mechanism that constructs semantic fingerprints using public probe inputs and symmetric KL divergence.<n>Second, it splits the LLM into three parts across clients and edge servers, with the cloud used only for adapter aggregation.<n>Third, it
arXiv Detail & Related papers (2026-01-20T10:33:19Z) - Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding [6.130486652666936]
Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers.<n>This paper is the first to propose a unified framework that jointly optimize user association and resource allocation to support efficient parallel speculative decoding.<n>Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy.
arXiv Detail & Related papers (2025-11-03T16:04:44Z) - GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation [12.921040231832082]
We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance aggregation for feed-forward network SparSification.<n> Empirical results across multiple Large Language Models (LLMs) and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods.
arXiv Detail & Related papers (2025-08-19T22:50:20Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - ReinDSplit: Reinforced Dynamic Split Learning for Pest Recognition in Precision Agriculture [13.00865517063611]
We introduce ReinDSplit, a reinforcement learning framework that dynamically tailors split points for each device.<n>A Q-learning agent acts as an adaptive orchestrator, balancing workloads and latency thresholds across devices.<n>We evaluate ReinDSplit on three insect classification datasets using ResNet18, GoogleNet, and MobileNetV2.
arXiv Detail & Related papers (2025-06-16T19:18:56Z) - The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z) - Split Learning in Computer Vision for Semantic Segmentation Delay Minimization [25.0679083637967]
We propose a novel approach to minimize the inference delay in semantic segmentation using split learning (SL)<n>SL is tailored to the needs of real-time computer vision (CV) applications for resource-constrained devices.
arXiv Detail & Related papers (2024-12-18T19:07:25Z) - Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning.
As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers.
We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z) - Design and Prototyping Distributed CNN Inference Acceleration in Edge
Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing.
Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16.
It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z) - Predictive GAN-powered Multi-Objective Optimization for Hybrid Federated
Split Learning [56.125720497163684]
We propose a hybrid federated split learning framework in wireless networks.
We design a parallel computing scheme for model splitting without label sharing, and theoretically analyze the influence of the delayed gradient caused by the scheme on the convergence speed.
arXiv Detail & Related papers (2022-09-02T10:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.