Related papers: Skipping Computations in Multimodal LLMs

Skipping Computations in Multimodal LLMs

URL: http://arxiv.org/abs/2410.09454v1
Date: Sat, 12 Oct 2024 09:21:45 GMT
Title: Skipping Computations in Multimodal LLMs
Authors: Mustafa Shukor, Matthieu Cord,
Abstract summary: This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference. We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers. Our findings validate that significant amount of computations can be avoided at inference time.
Score: 63.29737699997859
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable success in both textual and multimodal domains. However, this success often comes with substantial computational costs, particularly when handling lengthy sequences of multimodal inputs. This has sparked many efforts focusing on enhancing efficiency during training and inference. In this study, we investigate the computation redundancy in Multimodal Large Language Models (MLLMs) during inference. We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention (SA) layers. Additionally, we explore parallelizing certain layers, such as FFN and SA layers. Our findings validate that (1) significant amount of computations can be avoided at inference time, especially for tasks such as Visual Question Answering (VQA). (2) Skipping computations during training can recover 97% of the original performance, even when skipping half of the blocks or removing 70% of the weights. Alternatively, (3) properly training with smaller LLMs can yield comparable performance to LLMs 2 or 3 times larger. To conclude, we extend our investigation to recent MLLMs, such as LLaVA-1.5, showing similar observations. Our work show that there is redundant computations inside MLLMs and thus the potential for significantly improving inference costs without sacrificing performance. The code is available here: https://github.com/mshukor/ima-lmms.

Related papers

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs. With a minimalist design, our method can be applied to both video and image LLMs. Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z)
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models [87.43596173378913]
We propose an innovative strategy for existing MLLMs called $gamma$-MoD. In $gamma$-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM. Based on ARank, we propose two novel designs to maximize the computational sparsity of MLLM.
arXiv Detail & Related papers (2024-10-17T17:59:53Z)
The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs [2.6968321526169503]
Post-training quantization reduces the computational demand of Large Language Models (LLMs) but can weaken some of their capabilities. This paper explores how quantization affects smaller LLMs' ability to perform retrieval-augmented generation (RAG) Our findings reveal that if a 7B LLM performs the task well, quantization does not impair its performance and long-context reasoning capabilities.
arXiv Detail & Related papers (2024-06-10T08:23:52Z)
Efficient Multimodal Large Language Models: A Survey [60.7614299984182]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. The extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. This survey provides a comprehensive and systematic review of the current state of efficient MLLMs.
arXiv Detail & Related papers (2024-05-17T12:37:10Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Self-Selected Attention Span for Accelerating Large Language Model Inference [10.305434265471938]
Large language models (LLMs) can solve challenging tasks. LLMs' inference computation is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. We capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency.
arXiv Detail & Related papers (2024-04-14T19:36:04Z)
FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization [6.931020818874328]
We introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the large channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss. Our work achieves up to 2$times$ speedup and 2.3$times$ memory reduction for LLMs with negligible loss in accuracy.
arXiv Detail & Related papers (2024-02-28T02:00:34Z)
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM. InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.