Fugu-MT 論文翻訳(概要): BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

論文の概要: BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

arxiv url: http://arxiv.org/abs/2606.10135v2
Date: Wed, 10 Jun 2026 12:21:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 14:23:44.377972
Title: BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression
Title（参考訳）: BiWM:双方向自動回帰によるオープンソースのインタラクティブビデオワールドモデルの改善
Authors: Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma,
Abstract要約: 双方向自己回帰パラダイムに基づくインタラクティブなビデオワールドモデルのための,最初のフルスタックフレームワークであるBiWMを紹介する。トレーニング済みのビデオバックボーンから、BiWMは微調整でカメラコントロールを注入し、数ステップのDistributed Matching Distillationステージを走らせる。 1つのレシピは、Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B、LTX-2.3-22Bにまたがる。
参考スコア（独自算出の注目度）: 13.648620674803079
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.
Abstract（参考訳）: 双方向ビデオ拡散モデルを自己回帰的パラダイムに移行することで、ビデオワールドモデルの相互作用性が向上するが、既存の因果パイプラインには多くのステージ(微調整、自己回帰訓練、因果初期化、数ステップの蒸留)が必要であり、エラーの蓄積による品質上の双方向モデルも追随する。 Yume-1.5やMatrix-Game-3.0といった最近の世界モデルは双方向の自己回帰的アプローチを採用しており、自己修正エラーの伝播から忠実さと安定した長距離ロールアウトを実現しているが、オープンソースフレームワーク(例:minWM)は因果モデルのみをサポートしている。本稿では、双方向自己回帰パラダイムに基づく対話型ビデオワールドモデルのための最初のフルスタックフレームワークであるBiWMを紹介し、生成品質と推論速度を協調的に最適化する。予めトレーニングされたビデオバックボーンから、微調整でカメラコントロールを注入し、数ステップのDistributed Matching Distillation (DMD)ステージを実行し、バックボーンをアクション/カメラ制御可能な世界モデルに変換する。 1つのレシピは、Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B、LTX-2.3-22Bにまたがる。 BiWMは、minWMが制御性を失うような現実世界のカメラ制御を可能にし、長期ロールアウト用にプラグ可能な履歴圧縮(FramePackスタイルとPackForcingスタイル)を統合し、オプションのNVFP4 4ビットトレーニング/推論パイプラインを提供する。 DMDのモード探索劣化に対処するため,シーンダイナミクスを保存するため,GANと大規模フォワードKLを付加する。資源制約された研究と高忠実度環境シミュレーションのためのBiWMをオープンソースとして公開する。

論文の概要: BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

関連論文リスト