Related papers: WDMoE: Wireless Distributed Mixture of Experts for Large Language Models

WDMoE: Wireless Distributed Mixture of Experts for Large Language Models

URL: http://arxiv.org/abs/2411.06681v1
Date: Mon, 11 Nov 2024 02:48:00 GMT
Title: WDMoE: Wireless Distributed Mixture of Experts for Large Language Models
Authors: Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang,
Abstract summary: Large Language Models (LLMs) have achieved significant success in various natural language processing tasks. We propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks.
Score: 68.45482959423323
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks. Specifically, we decompose the MoE layer in LLMs by placing the gating network and the preceding neural network layer at BS, while distributing the expert networks among the devices. This deployment leverages the parallel inference capabilities of expert networks on mobile devices, effectively utilizing the limited computing and caching resources of these devices. Accordingly, we develop a performance metric for WDMoE-based LLMs, which accounts for both model capability and latency. To minimize the latency while maintaining accuracy, we jointly optimize expert selection and bandwidth allocation based on the performance metric. Moreover, we build a hardware testbed using NVIDIA Jetson kits to validate the effectiveness of WDMoE. Both theoretical simulations and practical hardware experiments demonstrate that the proposed method can significantly reduce the latency without compromising LLM performance.

Related papers

Large Language Model (LLM)-enabled Reinforcement Learning for Wireless Network Optimization [79.27012080083603]
Large language models (LLMs) offer promising tools to enhance reinforcement learning in wireless networks.<n>We propose an LLM-assisted state representation and semantic extraction to enhance the multi-agent reinforcement learning framework.
arXiv Detail & Related papers (2026-01-15T01:42:39Z)
Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices [5.926813659185372]
Small-size large language models (LLMs) can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models. Only small-size LLMs can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models.
arXiv Detail & Related papers (2025-03-10T16:27:17Z)
NetOrchLLM: Mastering Wireless Network Orchestration with Large Language Models [11.015852090523229]
Large language models (LLMs) have revolutionized various domains by leveraging their sophisticated natural language understanding capabilities. This paper presents NETORCHLLM, a wireless NETwork ORCHestrator LLM framework that seamlessly orchestrates diverse wireless-specific models. A comprehensive framework is introduced, demonstrating the practical viability of our approach.
arXiv Detail & Related papers (2024-12-13T12:48:15Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z)
R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models [83.77114091471822]
Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML) A challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming. This is particularly pronounced for word embedding parameters in large language models (LLMs), which are crucial for language understanding. A physical layer framework is developed for resilient SFL with LLMs (R-SFLLM) over wireless networks.
arXiv Detail & Related papers (2024-07-16T12:21:29Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
WDMoE: Wireless Distributed Large Language Models with Mixture of Experts [65.57581050707738]
We propose a wireless distributed Large Language Models (LLMs) paradigm based on Mixture of Experts (MoE) We decompose the MoE layer in LLMs by deploying the gating network and the preceding neural network layer at base station (BS) and mobile devices. We design an expert selection policy by taking into account both the performance of the model and the end-to-end latency.
arXiv Detail & Related papers (2024-05-06T02:55:50Z)
NetLLM: Adapting Large Language Models for Networking [36.61572542761661]
We present NetLLM, the first framework that provides a coherent design to harness the powerful capabilities of LLMs with low efforts to solve networking problems. Specifically, NetLLM empowers the LLM to effectively process multimodal data in networking and efficiently generate task-specific answers.
arXiv Detail & Related papers (2024-02-04T04:21:34Z)
Device Sampling and Resource Optimization for Federated Learning in Cooperative Edge Networks [17.637761046608]
Federated learning (FedL) distributes machine learning (ML) across worker devices by having them train local models that are periodically aggregated by a server. FedL ignores two important characteristics of contemporary wireless networks: (i) the network may contain heterogeneous communication/computation resources, and (ii) there may be significant overlaps in devices' local data distributions. We develop a novel optimization methodology that jointly accounts for these factors via intelligent device sampling complemented by device-to-device (D2D) offloading.
arXiv Detail & Related papers (2023-11-07T21:17:59Z)
Computational Intelligence and Deep Learning for Next-Generation Edge-Enabled Industrial IoT [51.68933585002123]
We investigate how to deploy computational intelligence and deep learning (DL) in edge-enabled industrial IoT networks. In this paper, we propose a novel multi-exit-based federated edge learning (ME-FEEL) framework. In particular, the proposed ME-FEEL can achieve an accuracy gain up to 32.7% in the industrial IoT networks with the severely limited resources.
arXiv Detail & Related papers (2021-10-28T08:14:57Z)
Device Sampling for Heterogeneous Federated Learning: Theory, Algorithms, and Implementation [24.084053136210027]
We develop a sampling methodology based on graph sequential convolutional networks (GCNs) We find that our methodology while sampling less than 5% of all devices outperforms conventional federated learning (FedL) substantially both in terms of trained model accuracy and required resource utilization.
arXiv Detail & Related papers (2021-01-04T05:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.