Synera: Synergistic LLM Serving across Device and Cloud at Scale
- URL: http://arxiv.org/abs/2511.07423v1
- Date: Fri, 17 Oct 2025 04:31:50 GMT
- Title: Synera: Synergistic LLM Serving across Device and Cloud at Scale
- Authors: Genglin Wang, Liekang Zeng, Bufang Yang, Kaiwei Liu, Guoliang Xing, Chumin Sun, Li Zhou, Jie Sun, Zhenyu Yan,
- Abstract summary: Large Language Models (LLMs) are becoming key components in various mobile operating systems.<n>Their deployment suffers from a set of performance challenges, especially the generation quality and prolonged latency degradation.<n>This paper proposes Synera, a device-cloud synergistic LLM serving system that applies an efficient SLM-LLM synergistic mechanism.
- Score: 8.533983798094683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are becoming key components in various mobile operating systems, driving smart applications like interactive chatbots and personal assistants. While bringing enhanced intelligence to mobile ends, their deployment suffers from a set of performance challenges, especially the generation quality degradation and prolonged latency. Prior works have mainly relied on solutions of cloud offloading or on-device Small Language Models (SLMs). However, the former is usually limited by the communication bottleneck, and the latter sacrifices generation quality due to resource constraints. To mitigate these limitations, this paper proposes Synera, a device-cloud synergistic LLM serving system that applies an efficient SLM-LLM synergistic mechanism. Through empirical studies on LLM's unique computing characteristics, Synera identifies a set of underexplored optimization opportunities in device-cloud synergistic LLM inference, including offloading decisions, pipeline stalls, and batching bottlenecks. To translate them into enhanced performance, Synera introduces tailored designs of communication-efficient selective offloading, stall-free parallel inference, and scalable cloud batching. Extensive evaluations with real-world testbeds show that Synera enables 1.20-5.47x better generation quality against competitive baselines with on-par latency performance. Compared with existing cloud serving, Synera achieves 8.2-16.5% lower cloud serving cost on various benchmarks.
Related papers
- HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z) - Towards On-Device Personalization: Cloud-device Collaborative Data Augmentation for Efficient On-device Language Model [43.13807038270687]
CDCDA-PLM is a framework for deploying personalized on-device language models on user devices with support from a powerful cloud-based LLM.<n>Using both real and synthetic data, A personalized on-device language models (LMs) is fine-tuned via parameter-efficient fine-tuning (PEFT) modules.
arXiv Detail & Related papers (2025-08-29T02:33:13Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models [12.644230479753476]
Traditional cloud-based large language models (LLMs) meet high-accuracy requirements, but fall short of critical demands for low delay and enhanced privacy.<n>We propose HAT, a novel device-cloud collaborative inference framework that leverages the complementary strengths of U-shaped inference and speculative decoding.<n>We show that HAT achieves promising performance improvements, reducing TTFT by 41% to 54% and TBT by 41% to 77% compared to the baselines.
arXiv Detail & Related papers (2025-03-23T10:54:58Z) - Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices [5.926813659185372]
Small-size large language models (LLMs) can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models.<n>Only small-size LLMs can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models.
arXiv Detail & Related papers (2025-03-10T16:27:17Z) - CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration [1.6021932740447968]
Large Language Models (LLMs) exhibit remarkable human-like predictive capabilities.<n>It is challenging to deploy LLMs to provide efficient and adaptive inference services at the edge.<n>This paper proposes a novel Cloud-Edge Collaboration framework for LLMs (CE-CoLLM) to tackle these challenges.
arXiv Detail & Related papers (2024-11-05T06:00:27Z) - Cloud-Device Collaborative Learning for Multimodal Large Language Models [24.65882336700547]
We introduce a Cloud-Device Collaborative Continual Adaptation framework to enhance the performance of compressed, device-deployed MLLMs.
Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment.
arXiv Detail & Related papers (2023-12-26T18:46:14Z) - Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes [53.4856038354195]
Pre-trained large language models (LLMs) need fine-tuning to improve their responsiveness to natural language instructions.
FedKSeed employs zeroth-order optimization with a finite set of random seeds.
It significantly reduces transmission requirements between the server and clients to just a few random seeds.
arXiv Detail & Related papers (2023-12-11T13:03:21Z) - DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation Framework for Efficient Device Model Generalization [66.27399823422665]
Device Model Generalization (DMG) is a practical yet under-investigated research topic for on-device machine learning applications.<n>We propose an efficient Device-cloUd collaborative parametErs generaTion framework DUET.
arXiv Detail & Related papers (2022-09-12T13:26:26Z) - Auto-Split: A General Framework of Collaborative Edge-Cloud AI [49.750972428032355]
This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud.
To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
arXiv Detail & Related papers (2021-08-30T08:03:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.