Related papers: Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression

Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression

URL: http://arxiv.org/abs/2511.11608v1
Date: Mon, 03 Nov 2025 08:44:13 GMT
Title: Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression
Authors: Mingyu Sung, Suhwan Im, Daeho Bang, Il-Min Kim, Sangseok Yun, Jae-Mo Kang,
Abstract summary: We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing.<n>Across standard vision and LLM workloads, SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x.
Score: 6.932768187544348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern DNNs often rely on edge-cloud model partitioning (MP), but widely used schemes fix shallow, static split points that underutilize edge compute and concentrate latency and energy on the server. The problem is exacerbated in autoregressive (AR) LLM inference, where per-token forward passes repeatedly generate bulky intermediate features (IFs). We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing. SLICER combines (i) asymmetric top-K filtering (ATKF) to sparsify low-magnitude activations, (ii) magnitude-splitting (MS) to group the remaining non-zeros into equal-cardinality blocks, and (iii) adaptive bit quantization (ABQ) that selects per-block bitwidths under a distortion budget. Across standard vision and LLM workloads (e.g., ImageNet/COCO; HellaSwag, PIQA, ARC-E/C, GSM8K, HumanEval), SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x, while keeping task quality within ~0-3 pp of baseline. In multi-device settings and AR LLMs, SLICER scales by shifting meaningful compute to the edge and lowering bits-per-token and server time per token, stabilizing per-step traffic. The codec attaches to off-the-shelf models without retraining or architectural changes, offering a plug-and-play path to scalable, low-latency distributed inference. Code is provided in the supplementary material.

Related papers

PLA-Serve: A Prefill-Length-Aware LLM Serving System [33.313531352453346]
PLA-Serve identifies and disaggregates requests with different prompt lengths to reduce TTFT latency.<n>We observe that prompt-length variations lead to distinct bottlenecks, motivating an adaptive scheduling strategy.<n> PLA-Serve reduces prefill latency by over 30% compared to vanilla SG under prefill**-Lang**decode disaggregation.
arXiv Detail & Related papers (2026-01-04T18:14:24Z)
CycleSL: Server-Client Cyclical Update Driven Scalable Split Learning [60.59553507555341]
We introduce CycleSL, a novel aggregation-free split learning framework.<n>Inspired by alternating block coordinate descent, CycleSL treats server-side training as an independent higher-level machine learning task.<n>Our empirical findings highlight the effectiveness of CycleSL in enhancing model performance.
arXiv Detail & Related papers (2025-11-23T21:00:21Z)
Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks [5.186026342830856]
Split computing distributes deep neural network inference between resource-constrained edge devices and cloud servers.<n>We propose a novel lightweight compression framework that leverages Range Asymmetric Numeral Systems (rANS) encoding with asymmetric integer quantization and sparse tensor representation to reduce transmission overhead dramatically.
arXiv Detail & Related papers (2025-11-11T12:33:59Z)
Large Kernel Modulation Network for Efficient Image Super-Resolution [5.875680381119361]
Large Kernel Modulation Network (LKMN) is a pure CNN-based model.<n>LKMN has two core components: Enhanced Partial Large Kernel Block (EPLKB) and Cross-Gate Feed-Forward Network (CGFN)<n>LKMN-L achieves 0.23 dB PSNR improvement over DAT-light on the Manga109 dataset at $times$4 upscale, with nearly $times$4.8 times faster.
arXiv Detail & Related papers (2025-08-16T03:43:14Z)
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [84.70637613266835]
EoRA is a fine-tuning-free method that augments compressed Large Language Models with low-rank matrices.<n>EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs.
arXiv Detail & Related papers (2024-10-28T17:59:03Z)
Communication Efficient ConFederated Learning: An Event-Triggered SAGA Approach [67.27031215756121]
Federated learning (FL) is a machine learning paradigm that targets model training without gathering the local data over various data sources. Standard FL, which employs a single server, can only support a limited number of users, leading to degraded learning capability. In this work, we consider a multi-server FL framework, referred to as emphConfederated Learning (CFL) in order to accommodate a larger number of users.
arXiv Detail & Related papers (2024-02-28T03:27:10Z)
Adaptive Federated Pruning in Hierarchical Wireless Networks [69.6417645730093]
Federated Learning (FL) is a privacy-preserving distributed learning framework where a server aggregates models updated by multiple devices without accessing their private datasets. In this paper, we introduce model pruning for HFL in wireless networks to reduce the neural network scale. We show that our proposed HFL with model pruning achieves similar learning accuracy compared with the HFL without model pruning and reduces about 50 percent communication cost.
arXiv Detail & Related papers (2023-05-15T22:04:49Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
FrankenSplit: Efficient Neural Feature Compression with Shallow Variational Bottleneck Injection for Mobile Edge Computing [5.815300670677979]
We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an asymmetric environment. Our method achieves 60% lower than a state-of-the-art SC method without decreasing accuracy and is up 16x faster than offloading with existing standards.
arXiv Detail & Related papers (2023-02-21T14:03:22Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.