Related papers: MoE$^2$: Optimizing Collaborative Inference for Edge Large Language Models

MoE$^2$: Optimizing Collaborative Inference for Edge Large Language Models

URL: http://arxiv.org/abs/2501.09410v1
Date: Thu, 16 Jan 2025 09:36:32 GMT
Title: MoE$^2$: Optimizing Collaborative Inference for Edge Large Language Models
Authors: Lyudong Jin, Yanning Zhang, Yanhan Li, Shurong Wang, Howard H. Yang, Jian Wu, Meng Zhang,
Abstract summary: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks.<n>We introduce textitMixture-of-Edge-Experts (MoE$2$), a novel collaborative inference framework for edge LLMs.
Score: 43.83407446438587
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. Exploiting the heterogeneous capabilities of edge LLMs is crucial for diverse emerging applications, as it enables greater cost-effectiveness and reduced latency. In this work, we introduce \textit{Mixture-of-Edge-Experts (MoE$^2$)}, a novel collaborative inference framework for edge LLMs. We formulate the joint gating and expert selection problem to optimize inference performance under energy and latency constraints. Unlike conventional MoE problems, LLM expert selection is significantly more challenging due to the combinatorial nature and the heterogeneity of edge LLMs across various attributes. To this end, we propose a two-level expert selection mechanism through which we uncover an optimality-preserving property of gating parameters across expert selections. This property enables the decomposition of the training and selection processes, significantly reducing complexity. Furthermore, we leverage the objective's monotonicity and design a discrete monotonic optimization algorithm for optimal expert selection. We implement edge servers with NVIDIA Jetson AGX Orins and NVIDIA RTX 4090 GPUs, and perform extensive experiments. Our results validate that performance improvements of various LLM models and show that our MoE$^2$ method can achieve optimal trade-offs among different delay and energy budgets, and outperforms baselines under various system resource constraints.

Related papers

LLMize: A Framework for Large Language Model-Based Numerical Optimization [0.0]
Large language models (LLMs) have recently shown strong reasoning capabilities beyond traditional language tasks.<n>This paper presents LLMize, an open-source Python framework that enables LLM-driven optimization.
arXiv Detail & Related papers (2025-12-30T20:05:30Z)
EdgeReasoning: Characterizing Reasoning LLM Deployment on Edge GPUs [0.36050743818632486]
Large language models (LLMs) for reasoning tasks on edge GPU face critical challenges from strict latency constraints and limited computational resources.<n>To navigate these constraints, developers must balance reasoning versus non-reasoning architectures, selecting appropriate model sizes, allocating token budgets, and applying test-time scaling strategies.<n>We present EdgeReasoning, a comprehensive study characterizing the deployment of reasoning LLMs on edge GPUs.
arXiv Detail & Related papers (2025-10-21T04:18:25Z)
LLM4CMO: Large Language Model-aided Algorithm Design for Constrained Multiobjective Optimization [54.35609820607923]
Large language models (LLMs) offer new opportunities for assisting with algorithm design.<n>We propose LLM4CMO, a novel CMOEA based on a dual-population, two-stage framework.<n>LLMs can serve as efficient co-designers in the development of complex evolutionary optimization algorithms.
arXiv Detail & Related papers (2025-08-16T02:00:57Z)
HeurAgenix: Leveraging LLMs for Solving Complex Combinatorial Optimization Challenges [10.088078143772563]
Heuristic algorithms play a vital role in solving optimization (CO) problems.<n>HeurAgenix is a two-stage hyper-heuristic framework powered by large language models (LLMs)
arXiv Detail & Related papers (2025-06-18T07:20:01Z)
Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation [45.72492804683268]
Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning.<n>We propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model.
arXiv Detail & Related papers (2025-05-28T16:24:50Z)
Efficient Multi-modal Long Context Learning for Training-free Adaptation [96.21248144937627]
This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC)<n>It embeds demonstration examples directly into the model input.<n>It condenses long-context multimodal inputs into compact, task-specific memory representations.
arXiv Detail & Related papers (2025-05-26T10:49:44Z)
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [76.10639521319382]
We propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. We show that Symbolic-MoE's instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead.
arXiv Detail & Related papers (2025-03-07T18:03:13Z)
Improving Existing Optimization Algorithms with LLMs [0.9668407688201361]
This paper investigates how Large Language Models (LLMs) can enhance existing optimization algorithms. Using their pre-trained knowledge, we demonstrate their ability to propose innovative variations and implementation strategies. Our results show that an alternative proposed by GPT-4o outperforms the expert-designed of CMSA.
arXiv Detail & Related papers (2025-02-12T10:58:57Z)
Can Large Language Models Be Trusted as Black-Box Evolutionary Optimizers for Combinatorial Problems? [8.082897040940447]
Large Language Models (LLMs) offer a game-changing solution with their extensive knowledge and could democratize the optimization paradigm. It is therefore imperative to evaluate the suitability of LLMs as evolutionary mechanism (EVO)
arXiv Detail & Related papers (2025-01-25T05:19:19Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization process to enhance the multimodal reasoning capabilities of MLLMs. We develop a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B.
arXiv Detail & Related papers (2024-11-15T18:59:27Z)
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System [75.25394449773052]
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving. Yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness.
arXiv Detail & Related papers (2024-10-10T17:00:06Z)
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models [8.558834738072363]
Large language models (LLMs) have seen widespread adoption due to their remarkable performance across various applications.<n>These individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets.<n>We introduce SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool.
arXiv Detail & Related papers (2024-08-16T06:11:21Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Solving General Natural-Language-Description Optimization Problems with Large Language Models [34.50671063271608]
We propose a novel framework called OptLLM that augments LLMs with external solvers. OptLLM accepts user queries in natural language, convert them into mathematical formulations and programming codes, and calls the solvers to calculate the results. Some features of OptLLM framework have been available for trial since June 2023.
arXiv Detail & Related papers (2024-07-09T07:11:10Z)
LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning [69.95292905263393]
We show that gradient-based and high-level LLMs can effectively collaborate a combined optimization framework.<n>In this paper, we show that these complementary to each other and can effectively collaborate a combined optimization framework.
arXiv Detail & Related papers (2024-05-30T06:24:14Z)
Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications. MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling. Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z)
Large Language Model-Based Evolutionary Optimizer: Reasoning with elitism [1.1463861912335864]
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities. This paper asserts that LLMs possess the capability for zero-shot optimization across diverse scenarios. We introduce a novel population-based method for numerical optimization using LLMs.
arXiv Detail & Related papers (2024-03-04T13:57:37Z)
Polynomial Optimization: Enhancing RLT relaxations with Conic Constraints [0.0]
Conic optimization has emerged as a powerful tool for designing tractable and guaranteed algorithms for non-scale problems. We investigate the strengthening of the RLT relaxations of optimization problems through the addition of nine different types of constraints. We show how to design these variants and their performance with respect to each other and with respect to the standard RLT relaxations.
arXiv Detail & Related papers (2022-08-11T02:13:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.