ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
- URL: http://arxiv.org/abs/2507.10069v2
- Date: Fri, 15 Aug 2025 04:27:30 GMT
- Title: ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
- Authors: Zedong Liu, Shenggan Cheng, Guangming Tan, Yang You, Dingwen Tao,
- Abstract summary: Multimodal large language models (MLLMs) handle images, videos, and audio by incorporating feature extractors and projection modules.<n>Current tightly coupled serving architectures struggle to distinguish between mixed request types.<n>We introduce Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity.
- Score: 9.93378263858092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components -- combined with complex inference pipelines and heterogeneous workloads -- introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) latency and poor resource utilization. To address this, we introduce Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2x and achieving 3.2-4.5x higher throughput while meeting service-level objectives (SLOs).
Related papers
- MDL: A Unified Multi-Distribution Learner in Large-scale Industrial Recommendation through Tokenization [14.534152704620261]
Industrial recommender systems increasingly adopt multi-scenario learning (MSL) and multi-task learning (MTL) to handle diverse user interactions and contexts.<n>Existing approaches suffer from two critical drawbacks: (1) underutilization of large-scale model parameters due to limited interaction with complex feature modules, and (2) difficulty in jointly modeling scenario and task information in a unified framework.<n>We propose a unified textbfMulti-textbfDistribution textbfL MSL framework, inspired by the "prompting" paradigm in large language models (LLMs)
arXiv Detail & Related papers (2026-02-07T12:34:27Z) - From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z) - FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data [60.09659670497899]
Large language models (LLMs) have a diverse landscape of models, each excelling at different tasks.<n>This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data.<n>We argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs.
arXiv Detail & Related papers (2025-07-14T17:58:02Z) - PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning [54.73049408950049]
We propose a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning.<n>Our approach improves unified multimodal retrieval from both structural and learning perspectives.
arXiv Detail & Related papers (2025-07-10T16:47:25Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation [67.84581846180458]
Session-based recommendation (SBR) predicts the next item based on anonymous sessions.<n>Recent Multimodal SBR methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness.<n>We propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR.
arXiv Detail & Related papers (2025-04-13T07:49:08Z) - ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving [19.388562622309838]
Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text.<n>We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models.<n>We propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling.
arXiv Detail & Related papers (2025-02-02T22:10:40Z) - AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes.
AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism [12.521026493432181]
Existing large language models (LLMs) cannot efficiently serve variable-length requests in different phases.
We propose a new parallelism paradigm, elastic sequence parallelism (ESP), to adapt to the variance between different requests and phases.
LoongServe improves the maximum throughput by up to 3.85$times$ compared to the chunked prefill and 5.81$times$ compared to the prefill-decoding disaggregation.
arXiv Detail & Related papers (2024-04-15T07:45:04Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs
for Embodied AI [10.82017289243097]
Large Language Models (LLMs) are capable of reasoning over diverse input data modalities through pre-trained encoders.
m-LLM improves the task accuracy by up to 4% compared to the best existing scheme.
arXiv Detail & Related papers (2023-12-13T04:08:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.