Token Level Routing Inference System for Edge Devices
- URL: http://arxiv.org/abs/2504.07878v1
- Date: Thu, 10 Apr 2025 15:54:19 GMT
- Title: Token Level Routing Inference System for Edge Devices
- Authors: Jianshu She, Wenhao Zheng, Zhengzhong Liu, Hongyi Wang, Eric Xing, Huaxiu Yao, Qirong Ho,
- Abstract summary: We present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation.<n>Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.
- Score: 21.721914273034972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.
Related papers
- FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer [81.12174905444229]
Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy.
We propose a new model called FuXi-$alpha$ to address these issues.
Our model outperforms existing models, with its performance continuously improving as the model size increases.
arXiv Detail & Related papers (2025-02-05T09:46:54Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.
CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.
CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution [1.8029479474051309]
We design a hybrid edge-cloud solution that leverages the efficiency of smaller models for local processing while deferring to larger, more accurate cloud-based models when necessary.
Specifically, we propose a novel unsupervised data generation method, Dual-Model Distillation (DMD), to train a lightweight switcher model that can predict when the edge model's output is uncertain.
Experimental results on the action classification task show that our framework not only requires less computational overhead, but also improves accuracy compared to using a large model alone.
arXiv Detail & Related papers (2024-10-16T02:06:27Z) - Knowledge boosting during low-latency inference [20.617827647115874]
Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints.
We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance.
Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications.
arXiv Detail & Related papers (2024-07-09T22:04:23Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.<n>DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation [56.79064699832383]
We establish a Cloud-Edge Elastic Model Adaptation (CEMA) paradigm in which the edge models only need to perform forward propagation.
In our CEMA, to reduce the communication burden, we devise two criteria to exclude unnecessary samples from uploading to the cloud.
arXiv Detail & Related papers (2024-02-27T08:47:19Z) - Big model only for hard audios: Sample dependent Whisper model selection
for efficient inferences [7.592727209806414]
Several ASR models exist in various sizes, with different inference costs leading to different performance levels.
We propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription.
By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
arXiv Detail & Related papers (2023-09-22T08:50:58Z) - A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes [54.83802872236367]
We propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios.
The proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model.
The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss.
arXiv Detail & Related papers (2022-04-13T04:15:51Z) - Real-time Human Detection Model for Edge Devices [0.0]
Convolutional Neural Networks (CNNs) have replaced traditional feature extraction and machine learning models in detection and classification tasks.
Lightweight CNN models have been recently introduced for real-time tasks.
This paper suggests a CNN-based lightweight model that can fit on a limited edge device such as Raspberry Pi.
arXiv Detail & Related papers (2021-11-20T18:42:17Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Communication-Computation Efficient Device-Edge Co-Inference via AutoML [4.06604174802643]
Device-edge co-inference partitions a deep neural network between a resource-constrained mobile device and an edge server.
On-device model sparsity level and intermediate feature compression ratio have direct impacts on workload and communication overhead.
We propose a novel automated machine learning (AutoML) framework based on deep reinforcement learning (DRL)
arXiv Detail & Related papers (2021-08-30T06:36:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.