Dedelayed: Deleting remote inference delay via on-device correction
- URL: http://arxiv.org/abs/2510.13714v1
- Date: Wed, 15 Oct 2025 16:13:44 GMT
- Title: Dedelayed: Deleting remote inference delay via on-device correction
- Authors: Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar,
- Abstract summary: We introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays.<n>Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames.
- Score: 5.382679710017697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.
Related papers
- CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - Adaptive Deadline and Batch Layered Synchronized Federated Learning [66.93447103966439]
Federated learning (FL) enables collaborative model training across distributed edge devices while preserving data privacy, and typically operates in a round-based synchronous manner.<n>We propose ADEL-FL, a novel framework that jointly optimize per-round deadlines and user-specific batch sizes for layer-wise aggregation.
arXiv Detail & Related papers (2025-05-29T19:59:18Z) - Faster and Better LLMs via Latency-Aware Test-Time Scaling [47.3923926808606]
Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference.<n>Existing research has overlooked the efficiency of TTS from a latency-sensitive perspective.<n>We demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical.
arXiv Detail & Related papers (2025-05-26T07:51:30Z) - CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection [11.714072240331518]
CorrDiff is designed to tackle the challenge of delays in real-time detection systems.<n>It is able to utilize runtime-estimated temporal cues to predict objects' locations for multiple future frames.<n>It meets the stringent real-time processing requirements on all kinds of devices.
arXiv Detail & Related papers (2025-01-09T10:34:25Z) - MTD: Multi-Timestep Detector for Delayed Streaming Perception [0.5439020425819]
Streaming perception is a task of reporting the current state of the world, which is used to evaluate the delay and accuracy of autonomous driving systems.
This paper propose the Multi- Timestep Detector (MTD), an end-to-end detector which uses dynamic routing for multi-branch future prediction.
The proposed method has been evaluated on the Argoverse-HD dataset, and the experimental results show that it has achieved state-of-the-art performance across various delay settings.
arXiv Detail & Related papers (2023-09-13T06:23:58Z) - Design and Prototyping Distributed CNN Inference Acceleration in Edge
Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing.
Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16.
It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z) - Service Delay Minimization for Federated Learning over Mobile Devices [36.027677482303076]
Federated learning over mobile devices has fostered numerous intriguing applications/services.
We propose a service delay efficient FL (SDEFL) scheme over mobile devices.
arXiv Detail & Related papers (2022-05-19T21:25:02Z) - Selective Network Linearization for Efficient Private Inference [49.937470642033155]
We propose a gradient-based algorithm that selectively linearizes ReLUs while maintaining prediction accuracy.
The results demonstrate up to $4.25%$ more accuracy (iso-ReLU count at 50K) or $2.2times$ less latency (iso-accuracy at 70%) than the current state of the art.
arXiv Detail & Related papers (2022-02-04T19:00:24Z) - R-TOD: Real-Time Object Detector with Minimized End-to-End Delay for
Autonomous Driving [3.366875318492424]
This paper aims to provide more comprehensive understanding of the end-to-end delay.
Three optimization methods are implemented: (i) on-demand capture, (ii) zero-slack pipeline, and (iii) contention-free pipeline.
Our experimental results show a 76% reduction in the end-to-end delay of Darknet YOLO v3.
arXiv Detail & Related papers (2020-10-23T01:03:46Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception.
The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant.
We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.