MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
- URL: http://arxiv.org/abs/2510.25327v3
- Date: Fri, 31 Oct 2025 06:42:58 GMT
- Title: MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
- Authors: Runxi Huang, Mingxuan Yu, Mingyu Tsoi, Xiaomin Ouyang,
- Abstract summary: We propose MMEdge, a new on-device multi-modal inference framework based on pipelined sensing and encoding.<n>Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units.<n> MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
- Score: 1.6572113577265137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
Related papers
- Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty [37.15356899831919]
Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams.<n>We propose a novel neuro-inspired non-blocking inference paradigm that employs adaptive temporal windows of integration.<n>Our framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff.
arXiv Detail & Related papers (2025-11-20T10:48:54Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z) - A Lightweight Group Multiscale Bidirectional Interactive Network for Real-Time Steel Surface Defect Detection [15.140649886958945]
Group Multiscale Bidirectional Interactive (GMBI) modules enhance multiscale feature extraction and interaction.<n>Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution.
arXiv Detail & Related papers (2025-08-22T13:58:35Z) - FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - AdaFlow: Opportunistic Inference on Asynchronous Mobile Data with Generalized Affinity Control [16.944584145880793]
AdaFlow pioneers the formulation of structured cross-modality affinity in mobile contexts using a hierarchical analysis-based normalized matrix.
AdaFlow significantly reduces inference latency by up to 79.9% and enhances accuracy by up to 61.9%, outperforming status quo approaches.
arXiv Detail & Related papers (2024-10-31T15:28:22Z) - Towards A Flexible Accuracy-Oriented Deep Learning Module Inference Latency Prediction Framework for Adaptive Optimization Algorithms [0.49157446832511503]
This paper presents a framework for a deep learning module inference latency prediction framework.
It hosts a set of customizable input parameters to train multiple different RMs per DNN module.
It automatically selects a set of trained RMs leading to the highest possible overall prediction accuracy.
arXiv Detail & Related papers (2023-12-11T15:15:48Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Combining Multi-Objective Bayesian Optimization with Reinforcement Learning for TinyML [4.2019872499238256]
We propose a novel strategy for deploying deep neural networks on microcontrollers (TinyML) based on multi-objective Bayesian optimization (MOBOpt)<n>Our methodology aims at efficiently finding tradeoffs between a DNN's predictive accuracy, memory requirements on a given target system, and computational complexity.
arXiv Detail & Related papers (2023-05-23T14:31:52Z) - Dynamic Multimodal Fusion [8.530680502975095]
Dynamic multimodal fusion (DynMM) is a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference.
Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach.
arXiv Detail & Related papers (2022-03-31T21:35:13Z) - Adaptive Anomaly Detection for Internet of Things in Hierarchical Edge
Computing: A Contextual-Bandit Approach [81.5261621619557]
We propose an adaptive anomaly detection scheme with hierarchical edge computing (HEC)
We first construct multiple anomaly detection DNN models with increasing complexity, and associate each of them to a corresponding HEC layer.
Then, we design an adaptive model selection scheme that is formulated as a contextual-bandit problem and solved by using a reinforcement learning policy network.
arXiv Detail & Related papers (2021-08-09T08:45:47Z) - TimeAutoML: Autonomous Representation Learning for Multivariate
Irregularly Sampled Time Series [27.0506649441212]
We propose an autonomous representation learning approach for multivariate time series (TimeAutoML) with irregular sampling rates and variable lengths.
Extensive empirical studies on real-world datasets demonstrate that the proposed TimeAutoML outperforms competing approaches on various tasks by a large margin.
arXiv Detail & Related papers (2020-10-04T15:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.