Fugu-MT 論文翻訳(概要): minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

論文の概要: minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

arxiv url: http://arxiv.org/abs/2605.30263v1
Date: Thu, 28 May 2026 17:27:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.584322
Title: minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
Title（参考訳）: minWM: リアルタイムインタラクティブビデオワールドモデルのためのフルスタックオープンソースフレームワーク
Authors: Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun, Kaiwen Zheng, Guande He, Xiao Yang, Chongxuan Li, Fan Bao, Jun Zhu,
Abstract要約: minWMはリアルタイムインタラクティブなビデオワールドモデルを構築するためのフルスタックのオープンソースフレームワークである。 minWMは既存の双方向T2V/TI2Vビデオ基盤モデルをカメラ制御可能な数ステップの自己回帰世界モデルに変換する。
参考スコア（独自算出の注目度）: 51.45338589543413
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)
Abstract（参考訳）: 最近のビデオ拡散基盤モデルは高品質なビデオ生成において顕著な進歩を遂げているが、それらをリアルタイムなインタラクティブなビデオワールドモデルに転換することは依然として困難である。インタラクティブな世界モデルは、制御可能、因果、低レイテンシのロールアウトを必要としており、実際には、データ構築、制御可能微調整、自動回帰トレーニング、数ステップの蒸留、ストリーミング推論にまたがる完全なパイプラインが必要である。本研究では,リアルタイムインタラクティブなビデオワールドモデルを構築するためのオープンソースフレームワークminWMを紹介する。 minWMは、既存の双方向T2V/TI2Vビデオ基盤モデルをカメラ制御可能な数ステップの自動回帰世界モデルに変換するエンドツーエンドパイプラインを提供する。具体的には、まずカメラ制御付き双方向ビデオ拡散モデルを微調整し、次に、AR拡散トレーニング、因果整合蒸留、非対称DMDを含むCausal Forcing/Causal Forcing++パイプラインを適用して、低遅延ロールアウトのための数ステップの自己回帰生成器に蒸留する。 Wan2.1-T2V-1.3B や HY1.5-TI2V-8B など、オープンな典型的なバックボーンをインスタンス化し、クロスアテンションベースのコンディションインジェクションとMMDiTスタイルのアーキテクチャの両方をカバーする。 minWMはまた、HY-WorldPlayのような既存のビデオワールドモデルを新しいデータ配信、トレーニングレシピ、遅延ターゲットに適応する機能もサポートする。実行可能なスクリプト、チェックポイント、ドキュメンテーション、推論コードだけでなく、カメラの軌道品質、可制御性トレーニングステップ、バッチサイズ要件の最小化も実現しています。 minWMは、リアルタイムインタラクティブなビデオワールドモデルの構築と適応のための再現可能で拡張可能なレシピとして機能することを願っている。 Project Page: [https://github.com/shengshu-ai/minWM] (https://github.com/shengshu-ai/minWM)

論文の概要: minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

関連論文リスト