Related papers: EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

URL: http://arxiv.org/abs/2408.11795v3
Date: Sun, 06 Apr 2025 18:52:08 GMT
Title: EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Authors: Feipeng Ma, Yizhou Zhou, Zheyu Zhang, Shilin Yan, Hebei Li, Zilong He, Siying Wu, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun,
Abstract summary: Current approaches for vision and language interaction fall into two categories: self-attention-based and cross-attention-based methods.<n>We modify the original self-attention mechanism in MLLM to a composite attention mechanism.<n>EE-MLLM significantly outperforms Flamingo with limited training data, and reduces the prefilling time to 79 ms on an H800 GPU.<n>We present a training-free variant named EE-MLLM-F, which reduces the computation cost of self-attention-based method without additional training.
Score: 15.449472477182061
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated satisfactory performance across various vision-language tasks. Current approaches for vision and language interaction fall into two categories: self-attention-based and cross-attention-based methods. However, both approaches present inherent limitations, forcing a trade-off between data and computational efficiency. To address this issue, we introduce the Data-$\textbf{E}$fficient and Compute-$\textbf{E}$fficient $\textbf{MLLM}$ ($\textbf{EE-MLLM}$). Specifically, we modify the original self-attention mechanism in MLLM to a composite attention mechanism. This mechanism has two key characteristics: 1) eliminating the computational overhead of self-attention among visual tokens to achieve $\textbf{compute efficiency}$, and 2) reusing the weights from each layer of LLM to facilitate effective vision-language modality alignment for $\textbf{data efficiency}$. As a result, EE-MLLM significantly outperforms Flamingo with limited training data, and reduces the prefilling time to 79 ms on an H800 GPU, compared to LLaVA's 277 ms. To further investigate the efficiency of EE-MLLM, we present a training-free variant named EE-MLLM-F, which reduces the computation cost of self-attention-based method without additional training. Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks, including general-purpose datasets like MMBench and SeedBench, as well as fine-grained tasks such as TextVQA and DocVQA.

Related papers

Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training [12.911726316306755]
We introduce OrchMLLM, a framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. Batch Post-Balancing Dispatcher and MLLM Global Orchestrator are used to eliminate mini-batch imbalances in sequential data. OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6%$ when training an 84B MLLM with three modalities on $2560$ H100 GPU, outperforming Megatron-LM by up to $3.1times$ in throughput.
arXiv Detail & Related papers (2025-03-31T08:24:23Z)
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs. With a minimalist design, our method can be applied to both video and image LLMs. Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z)
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone. The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE) DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
Large language models enabled multiagent ensemble method for efficient EHR data labeling [9.481473827205159]
This study introduces a novel multiagent ensemble method powered by LLMs to address a key challenge in ML - data labeling. By using the ensemble LLMs and natural language processing, we labeled MIMIC-IV ECG dataset of 623,566 ECG reports with an estimated accuracy of 98.2%. We applied the ensemble LLMs method to identify SDOH from social history sections of 1,405 EHR clinical notes, also achieving competitive performance.
arXiv Detail & Related papers (2024-10-21T22:12:00Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models [87.43596173378913]
We propose an innovative strategy for existing MLLMs called $gamma$-MoD. In $gamma$-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM. Based on ARank, we propose two novel designs to maximize the computational sparsity of MLLM.
arXiv Detail & Related papers (2024-10-17T17:59:53Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings. EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z)
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning [15.919493497867567]
This study aims to evaluate the performance of Multimodal Large Language Models (MLLMs) on the VALSE benchmark. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets.
arXiv Detail & Related papers (2024-07-17T11:26:47Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic [6.46176287368784]
We propose textbfModel textbfExclusive textbfTask textbfArithmetic for merging textbfGPT-scale models. Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.
arXiv Detail & Related papers (2024-06-17T10:12:45Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations. One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks. We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs. We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z)
ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities. We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z)
InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks. InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)
Generative Multimodal Entity Linking [24.322540112710918]
Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to referent entities from a knowledge base. Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters. We propose GEMEL, a Generative Multimodal Entity Linking framework based on Large Language Models (LLMs) Our framework is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution.
arXiv Detail & Related papers (2023-06-22T07:57:19Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.