Efficient Multi-modal Long Context Learning for Training-free Adaptation
- URL: http://arxiv.org/abs/2505.19812v1
- Date: Mon, 26 May 2025 10:49:44 GMT
- Title: Efficient Multi-modal Long Context Learning for Training-free Adaptation
- Authors: Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian,
- Abstract summary: This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC)<n>It embeds demonstration examples directly into the model input.<n>It condenses long-context multimodal inputs into compact, task-specific memory representations.
- Score: 96.21248144937627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.
Related papers
- PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning [54.73049408950049]
We propose a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning.<n>Our approach improves unified multimodal retrieval from both structural and learning perspectives.
arXiv Detail & Related papers (2025-07-10T16:47:25Z) - Learning to Inference Adaptively for Multimodal Large Language Models [19.510735093226703]
We introduce AdaLLaVA, an adaptive inference framework that learns to reconfigure operations in an MLLM during inference.<n>Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime.
arXiv Detail & Related papers (2025-03-13T21:39:38Z) - Robust Multimodal Learning via Cross-Modal Proxy Tokens [11.704477276235847]
Multimodal models often experience a significant performance drop when one or more modalities are missing during inference.<n>We propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available.<n>Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality.
arXiv Detail & Related papers (2025-01-29T18:15:49Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget.
We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM.
We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - CoLLiE: Collaborative Training of Large Language Models in an Efficient
Way [59.09824823710863]
CoLLiE is an efficient library that facilitates collaborative training of large language models.
With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization.
arXiv Detail & Related papers (2023-12-01T08:02:16Z) - MultiWay-Adapater: Adapting large-scale multi-modal models for scalable
image-text retrieval [4.4173427917548524]
MultiWay-Adapter (MWA) is a novel framework featuring an 'Alignment Enhancer'
This enhancer deepens inter-modal alignment, enabling high transferability with minimal tuning effort.
Experiments show that unlike prior efficient tuning approaches, MWA maintains model effectiveness, while reducing training time by up-to 57%.
arXiv Detail & Related papers (2023-09-04T10:48:29Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.