Related papers: Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI

Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI

URL: http://arxiv.org/abs/2312.07886v1
Date: Wed, 13 Dec 2023 04:08:59 GMT
Title: Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI
Authors: Kai Huang, Boyuan Yang and Wei Gao
Abstract summary: Large Language Models (LLMs) are capable of reasoning over diverse input data modalities through pre-trained encoders. m-LLM improves the task accuracy by up to 4% compared to the best existing scheme.
Score: 10.82017289243097
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) are capable of reasoning over diverse input data modalities through pre-trained encoders. However, the growing diversity of input data modalities prevents incorporating all modalities into LLMs, especially when LLMs are deployed on resource-constrained edge devices for embodied AI applications. Instead, a better option is to adaptively involve only the useful modalities at runtime, depending on the current environmental contexts and task requirements. For such modality adaptation, existing work adopts fixed connections between encoders and the LLM's input layer, leading to high training cost at runtime and ineffective cross-modal interaction. In this paper, we address these limitations by presenting mPnP-LLM, a new technique that allows fully elastic, automated and prompt runtime modality adaptation, by connecting unimodal encoders to a flexible set of last LLM blocks and making such latent connections fully trainable at runtime. Experiments over the nuScenes-QA dataset show that mPnP-LLM can achieve up to 3.7x FLOPs reduction and 30% GPU memory usage reduction, while retaining on-par accuracy with the existing schemes. Under the same compute budget, mPnP-LLM improves the task accuracy by up to 4% compared to the best existing scheme.

Related papers

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism [9.93378263858092]
Multimodal large language models (MLLMs) handle images, videos, and audio by incorporating feature extractors and projection modules.<n>Current tightly coupled serving architectures struggle to distinguish between mixed request types.<n>We propose Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity.
arXiv Detail & Related papers (2025-07-14T08:53:48Z)
Efficient Multi-modal Long Context Learning for Training-free Adaptation [96.21248144937627]
This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC)<n>It embeds demonstration examples directly into the model input.<n>It condenses long-context multimodal inputs into compact, task-specific memory representations.
arXiv Detail & Related papers (2025-05-26T10:49:44Z)
Learning to Inference Adaptively for Multimodal Large Language Models [19.510735093226703]
We introduce AdaLLaVA, an adaptive inference framework that learns to reconfigure operations in an MLLM during inference. We conduct experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime.
arXiv Detail & Related papers (2025-03-13T21:39:38Z)
AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes. AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding. PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models. Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z)
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
ELMS: Elasticized Large Language Models On Mobile Devices [5.689405542579458]
On-device Large Language Models (LLMs) are revolutionizing mobile AI, enabling applications such as UI automation while addressing privacy concerns. We introduce ELMS, an on-device LLM service designed to provide elasticity in both the model and prompt dimensions. A one-time reorder neuroning technique, which utilizes the inherent permutation consistency within transformer models to create high-quality, elastic sub-models. A dual-head compact language model, which efficiently refines prompts and coordinates the elastic adaptation between the model prompt.
arXiv Detail & Related papers (2024-09-08T06:32:08Z)
MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning. We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy. We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.