Shared Mobile-Cloud Inference for Collaborative Intelligence
- URL: http://arxiv.org/abs/2002.00157v1
- Date: Sat, 1 Feb 2020 07:12:01 GMT
- Title: Shared Mobile-Cloud Inference for Collaborative Intelligence
- Authors: Mateen Ulhaq and Ivan V. Baji\'c
- Abstract summary: We present a shared mobile-cloud inference approach for neural model inference.
The strategy can improve inference latency, energy consumption, and network bandwidth usage.
Further performance gain can be achieved by compressing the feature tensor before its transmission.
- Score: 35.103437828235826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As AI applications for mobile devices become more prevalent, there is an
increasing need for faster execution and lower energy consumption for neural
model inference. Historically, the models run on mobile devices have been
smaller and simpler in comparison to large state-of-the-art research models,
which can only run on the cloud. However, cloud-only inference has drawbacks
such as increased network bandwidth consumption and higher latency. In
addition, cloud-only inference requires the input data (images, audio) to be
fully transferred to the cloud, creating concerns about potential privacy
breaches. We demonstrate an alternative approach: shared mobile-cloud
inference. Partial inference is performed on the mobile in order to reduce the
dimensionality of the input data and arrive at a compact feature tensor, which
is a latent space representation of the input signal. The feature tensor is
then transmitted to the server for further inference. This strategy can improve
inference latency, energy consumption, and network bandwidth usage, as well as
provide privacy protection, because the original signal never leaves the
mobile. Further performance gain can be achieved by compressing the feature
tensor before its transmission.
Related papers
- Knowledge boosting during low-latency inference [20.617827647115874]
Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints.
We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance.
Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications.
arXiv Detail & Related papers (2024-07-09T22:04:23Z) - Combining Cloud and Mobile Computing for Machine Learning [2.595189746033637]
We consider model segmentation as a solution to improving the user experience.
We show that the division not only reduces the wait time for users but can also be fine-tuned to optimize the workloads of the cloud.
arXiv Detail & Related papers (2024-01-20T06:14:22Z) - Mobile-Cloud Inference for Collaborative Intelligence [3.04585143845864]
There is an increasing need for faster execution and lower energy consumption for deep learning model inference.
Historically, the models run on mobile devices have been smaller and simpler in comparison to large state-of-the-art research models, which can only run on the cloud.
Cloud-only inference has drawbacks such as increased network bandwidth consumption and higher latency.
There is an alternative approach: shared mobile-cloud inference.
arXiv Detail & Related papers (2023-06-24T14:22:53Z) - Real-Time Image Demoireing on Mobile Devices [59.59997851375429]
We propose a dynamic demoireing acceleration method (DDA) towards a real-time deployment on mobile devices.
Our stimulus stems from a simple-yet-universal fact that moire patterns often unbalancedly distribute across an image.
Our method can drastically reduce the inference time, leading to a real-time image demoireing on mobile devices.
arXiv Detail & Related papers (2023-02-04T15:42:42Z) - PriMask: Cascadable and Collusion-Resilient Data Masking for Mobile
Cloud Inference [8.699639153183723]
A mobile device uses a secret small-scale neural network called MaskNet to mask the data before transmission.
PriMask significantly weakens the cloud's capability to recover the data or extract certain private attributes.
We apply PriMask to three mobile sensing applications with diverse modalities and complexities.
arXiv Detail & Related papers (2022-11-12T17:54:13Z) - Over-the-Air Federated Learning with Privacy Protection via Correlated
Additive Perturbations [57.20885629270732]
We consider privacy aspects of wireless federated learning with Over-the-Air (OtA) transmission of gradient updates from multiple users/agents to an edge server.
Traditional perturbation-based methods provide privacy protection while sacrificing the training accuracy.
In this work, we aim at minimizing privacy leakage to the adversary and the degradation of model accuracy at the edge server.
arXiv Detail & Related papers (2022-10-05T13:13:35Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - THE-X: Privacy-Preserving Transformer Inference with Homomorphic
Encryption [112.02441503951297]
Privacy-preserving inference of transformer models is on the demand of cloud service users.
We introduce $textitTHE-X$, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models.
arXiv Detail & Related papers (2022-06-01T03:49:18Z) - Auto-Split: A General Framework of Collaborative Edge-Cloud AI [49.750972428032355]
This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud.
To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
arXiv Detail & Related papers (2021-08-30T08:03:29Z) - Runtime Deep Model Multiplexing for Reduced Latency and Energy
Consumption Inference [6.896677899938492]
We propose a learning algorithm to design a light-weight neural multiplexer that calls the model that will consume the minimum compute resources for a successful inference.
Mobile devices can use the proposed algorithm to offload the hard inputs to the cloud while inferring the easy ones locally.
arXiv Detail & Related papers (2020-01-14T23:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.