Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems
- URL: http://arxiv.org/abs/2512.11532v1
- Date: Fri, 12 Dec 2025 13:07:00 GMT
- Title: Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems
- Authors: Chong Tang, Hao Dai, Jagmohan Chauhan,
- Abstract summary: Existing frameworks handle dynamic control-flow operators and unsupported kernels poorly, leaving CPU cores idle and causing high latency and memory spikes.<n>We introduce Parallax, a framework that accelerates mobile DNN inference without model or custom operator implementations.<n>Parallax achieves up to 46% latency reduction, maintains controlled memory overhead (26.5% on average), and delivers up to 30% energy savings compared with state-of-the-art frameworks.
- Score: 6.9060172287001365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing demand for real-time DNN applications on edge devices necessitates faster inference of increasingly complex models. Although many devices include specialized accelerators (e.g., mobile GPUs), dynamic control-flow operators and unsupported kernels often fall back to CPU execution. Existing frameworks handle these fallbacks poorly, leaving CPU cores idle and causing high latency and memory spikes. We introduce Parallax, a framework that accelerates mobile DNN inference without model refactoring or custom operator implementations. Parallax first partitions the computation DAG to expose parallelism, then employs branch-aware memory management with dedicated arenas and buffer reuse to reduce runtime footprint. An adaptive scheduler executes branches according to device memory constraints, meanwhile, fine-grained subgraph control enables heterogeneous inference of dynamic models. By evaluating on five representative DNNs across three different mobile devices, Parallax achieves up to 46% latency reduction, maintains controlled memory overhead (26.5% on average), and delivers up to 30% energy savings compared with state-of-the-art frameworks, offering improvements aligned with the responsiveness demands of real-time mobile inference.
Related papers
- Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures [4.058950730052848]
We present MERINDA, an FPGA-accelerated MR framework that restructures computation as a streaming dataflow pipeline.<n>On representative MR workloads, MERINDA delivers up to 6.3x fewer cycles than an FPGA-based LTC baseline.
arXiv Detail & Related papers (2025-12-05T19:38:34Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - Optimizing Multi-DNN Inference on Mobile Devices through Heterogeneous Processor Co-Execution [39.033040759452504]
Deep Neural Networks (DNNs) are increasingly deployed across diverse industries, driving demand for mobile device support.<n>Existing mobile inference frameworks often rely on a single processor per model, limiting hardware utilization and causing suboptimal performance and energy efficiency.<n>We propose an Advanced Multi-DNN Model Scheduling (ADMS) strategy for optimizing multi-DNN inference on mobile heterogeneous processors.
arXiv Detail & Related papers (2025-03-27T03:03:09Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores.
At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware.
Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters.
We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z) - Towards Memory-Efficient Neural Networks via Multi-Level in situ
Generation [10.563649948220371]
Deep neural networks (DNN) have shown superior performance in a variety of tasks.
As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices.
We propose a general and unified framework to trade expensive memory transactions with ultra-fast on-chip computations.
arXiv Detail & Related papers (2021-08-25T18:50:24Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.