Fugu-MT 論文翻訳(概要): ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

論文の概要: ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

arxiv url: http://arxiv.org/abs/2606.17730v1
Date: Tue, 16 Jun 2026 09:47:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.381765
Title: ActWorld: From Explorable to Interactive World Model via Action-Aware Memory
Title（参考訳）: ActWorld:Action-Aware Memoryによる探索可能な世界モデルからインタラクティブな世界モデルへ
Authors: Zhexiao Xiong, Yizhi Song, Hao Kang, Qing Yan, Liming Jiang, Jenson Yang, Zhoujie Fu, Stathi Fotiadis, Angtian Wang, Zichuan Liu, Bo Liu, Yiding Yang, Xin Lu, Nathan Jacobs,
Abstract要約: 本稿では,対話型世界モデルであるActWorldについて紹介する。実験の結果、ActWorldは単一のモデル内でフレキシブルなナビゲーションとリッチなオブジェクトインタラクションの両方をサポートしています。
参考スコア（独自算出の注目度）: 36.88820961480639
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.
Abstract（参考訳）: インタラクティブな世界モデルは、リアルタイムなユーザアクションの下で環境ダイナミクスをシミュレートすることを目的としている。しかし、アクション語彙はナビゲーションに限られており、ほとんどのアクションはモーション(例えば、歩いたり回ったり、周りを見回したり)に対応し、シーン内のオブジェクト(例えば、皿を拾ったり、ドアを開いたり、物理的な反応をトリガーしたり)とのインタラクションは欠落している。結果として得られる世界は、視覚的に探索可能であるが、真に実行可能なものではない。本稿では,従来のナビゲーション中心のジェネレータを拡張した対話型世界モデルであるActWorldを紹介し,チャンク自動回帰フレームワーク内での中間ロールアウトオブジェクトインタラクションをサポートする。ナビゲーションとインタラクションのギャップは2つのボトルネックに起因すると我々は主張する。まず、データのボトルネック: 正確で密度の高いラベルと人間とオブジェクトのインタラクションデータが欠如していること。第2に、メモリボトルネック: 既存の世界モデルにおける回帰バイアス履歴圧縮は、後続のオブジェクト状態を因果的に決定するイベント遷移フレームを破棄し、アクション鍛造の病理に繋がる。データ側では100Kのインタラクションビデオデータセットを構築し、それぞれにチェーン・オブ・ソート・推論を通じてチャンク毎のキャプションを付加する。モデル側では、イベント更新およびオブジェクト識別トークンを長期ロールアウトで保持する永続メモリバンクによって補完される、インタラクションの重要性による履歴圧縮をルーティングする階層的なアクション認識メモリ設計を導入する。実験の結果、ActWorldは単一のモデル内でフレキシブルなナビゲーションとリッチなオブジェクトインタラクションの両方をサポートし、視点制御を犠牲にすることなく、ナビゲーションのみのベースラインに対するインタラクションの忠実性を大幅に向上することがわかった。プロジェクトページはhttps://interactwm.github.io/ActWorld.comで公開されている。

論文の概要: ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

関連論文リスト