Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning
- URL: http://arxiv.org/abs/2602.03815v1
- Date: Tue, 03 Feb 2026 18:18:11 GMT
- Title: Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning
- Authors: Dingkun Zhang, Shuhan Qi, Yulin Wu, Xinyu Xiao, Xuan Wang, Long Chen,
- Abstract summary: Existing efforts in efficient training focus on reducing model sizes or trainable parameters.<n>Applying VTP at the training stage results in a training-inference mismatch.<n>We propose DualSpeed, a fast-slow framework for efficient training of MLLMs.
- Score: 19.291990706129994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model's behaviors. The slow-mode is the auxiliary mode, where the model is trained on full visual sequences to retain training-inference consistency. To boost its training, it further leverages self-distillation to learn from the sufficiently trained fast-mode. Together, DualSpeed can achieve both training efficiency and non-degraded performance. Experiments show DualSpeed accelerates the training of LLaVA-1.5 by 2.1$\times$ and LLaVA-NeXT by 4.0$\times$, retaining over 99% performance. Code: https://github.com/dingkun-zhang/DualSpeed
Related papers
- DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z) - FREE: Fast and Robust Vision Language Models with Early Exits [5.402030962296633]
We introduce FREE, an adversarial training approach within a GAN-based framework.<n>Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance.<n>We experimentally validate that our method speeds up the inference process by more than 1.51x while retaining comparable performance.
arXiv Detail & Related papers (2025-06-07T18:26:58Z) - DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding [26.39397960987363]
We propose a simple modification to pretrained transformer models.<n>Instead of concatenation with the language prompt at the start, we insert multimodal tokens directly into the middle.<n>Our results indicate that our method reduces computational costs during both training and inference.
arXiv Detail & Related papers (2025-04-27T18:56:26Z) - Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding.<n>We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models.<n> Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Efficient NLP Model Finetuning via Multistage Data Filtering [11.058786955754004]
We set to filter training examples in a streaming fashion, in tandem with training the target model.
Our key techniques are (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes.
Our method reduces the required training examples by up to 5.3$times$ and training time by up to 6.8$times$, while only seeing minor accuracy degradation.
arXiv Detail & Related papers (2022-07-28T21:43:31Z) - Automated Progressive Learning for Efficient Training of Vision
Transformers [125.22744987949227]
Vision Transformers (ViTs) have come with a voracious appetite for computing power, high-lighting the urgent need to develop efficient training methods for ViTs.
Progressive learning, a training scheme where the model capacity grows progressively during training, has started showing its ability in efficient training.
In this paper, we take a practical step towards efficient training of ViTs by customizing and automating progressive learning.
arXiv Detail & Related papers (2022-03-28T05:37:08Z) - A Practical Incremental Method to Train Deep CTR Models [37.54660958085938]
We introduce a practical incremental method to train deep CTR models, which consists of three decoupled modules.
Our method can achieve comparable performance to the conventional batch mode training with much better training efficiency.
arXiv Detail & Related papers (2020-09-04T12:35:42Z) - LightPAFF: A Two-Stage Distillation Framework for Pre-training and
Fine-tuning [146.51221523793342]
LightPAFF uses two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model.
LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.
arXiv Detail & Related papers (2020-04-27T14:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.