Related papers: VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

URL: http://arxiv.org/abs/2508.02317v3
Date: Thu, 07 Aug 2025 10:31:09 GMT
Title: VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Authors: Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu,
Abstract summary: Ve Omni is a training framework for large language models (LLMs)<n>Ve Omni introduces model-centric distributed recipes that decouples communication from computation.<n>Ve Omni can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs.
Score: 25.89459841661218
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. We present VeOmni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. VeOmni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. VeOmni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. Using VeOmni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

Related papers

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z)
Ola: Pushing the Frontiers of Omni-Modal Language Model [88.72389428177942]
We present Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding.<n>Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements.<n>We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
arXiv Detail & Related papers (2025-02-06T18:59:55Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z)
Multimodal Lego: Model Merging and Fine-Tuning Across Topologies and Modalities in Biomedicine [10.774128925670183]
Multimodal Lego (MM-Lego) is a general-purpose fusion framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning.<n>We show that MM-Lego can be used as a model merging method which achieves competitive performance with end-to-end fusion models without any fine-tuning.
arXiv Detail & Related papers (2024-05-30T11:14:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.