Fugu-MT 論文翻訳(概要): Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

論文の概要: Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

arxiv url: http://arxiv.org/abs/2605.02641v1
Date: Mon, 04 May 2026 14:26:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.333937
Title: Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Title（参考訳）: Mamoda2.5:DiT-MoEによる統一マルチモーダルモデルの実現
Authors: Yangming Shi, Shixiang Zhu, Tao Shen, Zhimiao Yu, Dengsheng Chen, Taicai Chen, Yunfei Yang, Juan Zhou, Chen Cheng, Liang Ma, Xibin Wu, Benxuan Yan, Ge Li, Tuoyu Zhang, Dan Li, Chang Liu, Zhenbang Sun,
Abstract要約: Mamoda2.5は、単一のアーキテクチャ内でマルチモーダル理解と生成をシームレスに統合する統合AR拡散フレームワークである。 Mamoda2.5はVBench 2.0上で最高の世代パフォーマンスを達成し、ビデオ編集品質の新たな記録を樹立した。
参考スコア（独自算出の注目度）: 28.80090439127626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.
Abstract（参考訳）: マモダ2.5(Mamoda2.5)は、単一のアーキテクチャ内でマルチモーダル理解と生成をシームレスに統合する統合AR拡散フレームワークである。モデル生成能力を効率的に向上するため,Diffusion Transformerのバックボーンに微細なMixture-of-Experts (MoE)設計(128のエキスパート,Top-8ルーティング)を装備し,3Bパラメータのみを活性化する25Bパラメータモデルを生成し,モデルの容量をスケールアップしながらトレーニングコストを大幅に削減する。 Mamoda2.5はVBench 2.0上での最上位世代のパフォーマンスを達成し、評価済みのオープンソースモデルを超え、OpenVE-Bench上のKling O1を含む現在のトップクラスのプロプライエタリモデルのパフォーマンスにマッチして、ビデオ編集品質の新記録を樹立した。さらに,30ステップの編集モデルを4ステップモデルに圧縮し,モデル推論を大幅に高速化する,連成数段蒸留・強化学習フレームワークを導入する。オープンソースのベースラインと比較すると、Mamoda2.5は最大95.9\times$高速なビデオ編集推論を実現している。現実世界のアプリケーションでは、Mamoda2.5は、広告シナリオにおけるコンテンツモデレーションと創造的な復元タスクのためにうまくデプロイされ、内部広告ビデオ編集シナリオで98%の成功率を達成した。

論文の概要: Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

関連論文リスト