Fugu-MT 論文翻訳(概要): MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

論文の概要: MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

arxiv url: http://arxiv.org/abs/2510.17519v2
Date: Wed, 22 Oct 2025 10:01:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:12.056883
Title: MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
Title（参考訳）: MUG-V 10B:大規模ビデオ生成モデルのための高効率訓練パイプライン
Authors: Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng,
Abstract要約: 大規模なビデオ生成モデルのトレーニングは、依然として困難でリソース集約的だ。データ処理,モデルアーキテクチャ,トレーニング戦略,インフラストラクチャの4つの柱を最適化するトレーニングフレームワークを提案する。モデルウェイト,Megatron-Coreベースの大規模トレーニングコード,ビデオ生成と拡張のための推論パイプラインなどを含む,完全なスタックをオープンソースとして公開しています。
参考スコア（独自算出の注目度）: 23.09416541835573
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in https://github.com/Shopee-MUG/MUG-V.
Abstract（参考訳）: 近年,映像・映像・3Dオブジェクト・シーンの大規模生成モデルが著しく進歩している。しかし、大規模なビデオ生成モデルのトレーニングは、クロスモーダルなテキスト・ビデオアライメント、関連する長いシーケンス、複雑な時空間依存性のために、特に困難でリソース集約的なままである。これらの課題に対処するために,4つの柱を最適化するトレーニングフレームワークを提案する。 (i)データ処理 (II)モデルアーキテクチャ (三)訓練戦略、及び (4)大規模ビデオ生成モデルのためのインフラ。これらの最適化により、データ前処理、ビデオ圧縮、パラメータスケーリング、カリキュラムベースの事前トレーニング、アライメントにフォーカスした後トレーニングのすべての段階において、大幅な効率向上とパフォーマンス向上を実現した。我々の生成したMUG-V 10Bは、最近の最先端ビデオジェネレータと総合的に一致し、eコマース指向のビデオ生成タスクにおいて、人間の評価における主要なオープンソースベースラインを超える。さらに重要なのは、モデルウェイト、Megatron-Coreベースの大規模トレーニングコード、ビデオ生成と拡張のための推論パイプラインなど、完全なスタックをオープンソースにしています。私たちの知る限り、これはMegatron-Coreを利用して高いトレーニング効率とニアリニアなマルチノードスケーリングを実現する大規模なビデオ生成トレーニングコードの最初の公開リリースであり、詳細はhttps://github.com/Shopee-MUG/MUG-Vで確認できる。

論文の概要: MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

関連論文リスト