Related papers: Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

URL: http://arxiv.org/abs/2409.09086v1
Date: Wed, 11 Sep 2024 12:44:12 GMT
Title: Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
Authors: Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo,
Abstract summary: Inf-MLLM is an efficient inference framework for Multimodal Large Language Models (MLLMs) We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU.
Score: 14.719538667881311
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called "attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.

Related papers

Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference [88.57742986765238]
Free-MoRef is a training-free approach to multiplex the context perception capabilities of Video-MLLMs.<n>Experiments show that Free-MoRef achieves full perception of 2$times$ to 8$times$ longer input frames without compression on a single A100 GPU.
arXiv Detail & Related papers (2025-08-04T07:31:10Z)
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM [21.967692616735196]
multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence.<n>We propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs.<n>This work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.
arXiv Detail & Related papers (2025-05-23T10:43:45Z)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling. We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos. Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z)
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework [11.512418684814026]
We propose COEF-VQ, a novel cascaded MLLM framework for better video quality understanding on TikTok. To demonstrate the effectiveness of COEF-VQ, we deployed this new framework onto the video management platform (VMP) at TikTok. We show that COEF-VQ leads to substantial performance gains with limit resource consumption in these two tasks.
arXiv Detail & Related papers (2024-12-11T08:10:32Z)
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs. With a minimalist design, our method can be applied to both video and image LLMs. Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE) DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
Intermittent Semi-working Mask: A New Masking Paradigm for LLMs [13.271151693864114]
Multi-turn dialogues are a key interaction method between humans and Large Language Models (LLMs) We propose a novel masking scheme called Intermittent Semi-working Mask (ISM) to address these problems.
arXiv Detail & Related papers (2024-08-01T13:22:01Z)
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z)
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning [15.770849688170477]
In-context learning (ICL) facilitates Large Language Models exhibiting emergent ability on downstream tasks without updating billions of parameters. Most primary MLLMs are only trained on single-image datasets, making them unable to read multi-modal demonstrations. We propose a general and light-weighted framework textbfAIM to tackle the mentioned problems through textbfAggregating textbfImage information of textbfMultimodal demonstrations to the dense latent space of the corresponding linguistic part.
arXiv Detail & Related papers (2024-06-11T08:12:43Z)
Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)
Efficient Multimodal Large Language Models: A Survey [60.7614299984182]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. The extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. This survey provides a comprehensive and systematic review of the current state of efficient MLLMs.
arXiv Detail & Related papers (2024-05-17T12:37:10Z)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks. InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.