Related papers: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training

Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training

URL: http://arxiv.org/abs/2503.23830v2
Date: Wed, 09 Apr 2025 06:39:29 GMT
Title: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Authors: Yijie Zheng, Bangjun Xiao, Lei Shi, Xiaoyang Li, Faming Wu, Tianyu Li, Xuefeng Xiao, Yang Zhang, Yuxuan Wang, Shouda Liu,
Abstract summary: We introduce OrchMLLM, a framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence.<n> Batch Post-Balancing Dispatcher and MLLM Global Orchestrator are used to eliminate mini-batch imbalances in sequential data.<n>OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6%$ when training an 84B MLLM with three modalities on $2560$ H100 GPU, outperforming Megatron-LM by up to $3.1times$ in throughput.
Score: 12.911726316306755
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal large language models (MLLMs), such as GPT-4o, are garnering significant attention. During the exploration of MLLM training, we identified Modality Composition Incoherence, a phenomenon that the proportion of a certain modality varies dramatically across different examples. It exacerbates the challenges of addressing mini-batch imbalances, which lead to uneven GPU utilization between Data Parallel (DP) instances and severely degrades the efficiency and scalability of MLLM training, ultimately affecting training speed and hindering further research on MLLMs. To address these challenges, we introduce OrchMLLM, a comprehensive framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. First, we propose Batch Post-Balancing Dispatcher, a technique that efficiently eliminates mini-batch imbalances in sequential data. Additionally, we integrate MLLM Global Orchestrator into the training framework to orchestrate multimodal data and tackle the issues arising from Modality Composition Incoherence. We evaluate OrchMLLM across various MLLM sizes, demonstrating its efficiency and scalability. Experimental results reveal that OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6\%$ when training an 84B MLLM with three modalities on $2560$ H100 GPUs, outperforming Megatron-LM by up to $3.1\times$ in throughput.

Related papers

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs [59.78917775399492]
Multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability.<n>We propose a training-free framework to mitigate this degradation.
arXiv Detail & Related papers (2026-01-12T15:27:51Z)
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z)
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints [100.02131897927484]
This paper focuses on the native training of Multimodal Large Language Models (MLLMs) in an end-to-end manner.<n>We propose a native MLLM called NaViL, combined with a simple and cost-effective recipe.<n> Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs.
arXiv Detail & Related papers (2025-10-09T17:59:37Z)
Visual Instruction Bottleneck Tuning [24.864232100295133]
multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts.<n>Existing methods to improve MLLM generalization typically require more instruction data or larger advanced model architectures.<n>We take an alternative approach to enhance the robustness of MLLMs under distribution shifts, from a representation learning perspective.
arXiv Detail & Related papers (2025-05-20T05:24:53Z)
MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization [15.01214559812713]
MQuant is a post-training quantization framework designed to tackle the challenges of multimodal large language models (MLLMs) On five mainstream MLLMs (including Qwen-VL, Mini-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (1% degradation) while reducing inference latency by up to 30%.
arXiv Detail & Related papers (2025-02-01T13:08:02Z)
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay [20.688382669309096]
p-MoD is an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance.<n>We adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing)
arXiv Detail & Related papers (2024-12-05T18:58:03Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models [87.43596173378913]
We propose an innovative strategy for existing MLLMs called $gamma$-MoD. In $gamma$-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM. Based on ARank, we propose two novel designs to maximize the computational sparsity of MLLM.
arXiv Detail & Related papers (2024-10-17T17:59:53Z)
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model [15.449472477182061]
Current approaches for vision and language interaction fall into two categories: self-attention-based and cross-attention-based methods. We modify the original self-attention mechanism in MLLM to a composite attention mechanism. EE-MLLM significantly outperforms Flamingo with limited training data, and reduces the prefilling time to 79 ms on an H800 GPU. We present a training-free variant named EE-MLLM-F, which reduces the computation cost of self-attention-based method without additional training.
arXiv Detail & Related papers (2024-08-21T17:36:37Z)
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities. We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z)
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE [83.00018517368973]
Large Language Models (LLMs) can extend their zero-shot capabilities to multimodal learning through instruction tuning. negative conflicts and interference may have a worse impact on performance. We combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning.
arXiv Detail & Related papers (2023-11-05T15:48:29Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.