Fugu-MT 論文翻訳(概要): WALL-WM: Carving World Action Modeling at the Event Joints

論文の概要: WALL-WM: Carving World Action Modeling at the Event Joints

arxiv url: http://arxiv.org/abs/2606.01955v1
Date: Mon, 01 Jun 2026 09:14:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.691407
Title: WALL-WM: Carving World Action Modeling at the Event Joints
Title（参考訳）: WALL-WM: イベント関節における世界行動モデリング
Authors: Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang,
Abstract要約: WALL-WMは、ビデオアクション学習をチャンク中心の最適化からイベントグラウンドのVLA事前トレーニングに移行するWorld Action Modelである。 WALL-WMはこのミスマッチに対処するため、セマンティックイベントに関する監視とデータの両方を整理する。実験により、WALL-WMは言語、シーン、タスクを幅広く一般化し、大規模実世界の一般化評価において最先端のパフォーマンスを達成することが示された。
参考スコア（独自算出の注目度）: 14.768586112050684
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.
Abstract（参考訳）: WALL-WMは、ビデオアクション学習をチャンク中心の最適化から、セマンティックコヒーレントなアクションイベントを学習のアトミック単位として使用して、イベントグラウンドのVision-Language-Actionプリトレーニングに移行する世界アクションモデルである。既存のWAMは、一般にマルチモーダルまたはビデオ基盤モデルから初期化され、現在の観察と指示に基づいて条件付けられた固定長アクションチャンクを最適化する。便利ではあるが、このチャンク中心の定式化は基本的な粒度のミスマッチを生み出す。言語はセマンティックな目標とイベントを記述し、視覚は連続的なシーンのダイナミクスを通じて進化し、アクションは制御レベルの時間スケールで動作する。 WALL-WMはこのミスマッチに対処するため、セマンティックイベントに関する監視とデータの両方を整理する。具体的には、イベントグレードのVLAプリトレーニングと、イベントレベルのキャプションとクラスタバランスのサンプリングから構築されたデータエコシステムを組み合わせることで、さまざまな振る舞いやシーン、タスク構造に関するスケーラブルな学習を可能にする。同じイベント事前のバックボーンから、WALL-WMは2つの補完推論モードをサポートする。イベントモードは次のイベント記述を消費し、可変長の実行チャンクを可能にする。一方、統一モードは、勾配連続VLAパスを保持しながら、従来の固定長チャンク推論を条件に、ステアケースデコード付きVLMを使用する。 Muon-Optimizerベースの大規模事前学習インフラとともに、WALL-WMは汎用WAMのための実用的なスケールアップレシピを提供する。実験により、WALL-WMは言語、シーン、タスクを幅広く一般化し、大規模実世界の一般化評価において最先端のパフォーマンスを達成することが示された。

論文の概要: WALL-WM: Carving World Action Modeling at the Event Joints

関連論文リスト