Fugu-MT 論文翻訳(概要): X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

論文の概要: X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

arxiv url: http://arxiv.org/abs/2605.24892v1
Date: Sun, 24 May 2026 06:37:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.494933
Title: X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling
Title（参考訳）: X-Foresight:予測世界モデリングによる共同視覚行動因果予測ネットワーク
Authors: Baolu Li, Jingyu Qian, Rui Guo, Yilun Chen, Hanpeng Liu, Yuan Lin, Junhong Zhou, Ruixin Liu, Willow Yang, Yutong Zheng, Zhenli Zhang, Tenglong, Gu, Zhuangzhuang Ding, Pengkun Zheng, Yu Zhang, Xianming Liu,
Abstract要約: 我々は,ビジョン・ランゲージ・アクション・アーキテクチャに直接統合された予測的世界モデルであるX-Foresightを紹介する。隣接するフレームではなく意味的に離れたチャンクを予測することで、X-Foresightは自明な外挿を逃れる。総合的な実験により、X-Foresightは計画性能においてVLAベースラインを大幅に上回っていることが示された。
参考スコア（独自算出の注目度）: 47.54820149491433
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.
Abstract（参考訳）: 物理世界の知識は主にビデオに存在している。ビジョン・ランゲージ・アクション(VLA)モデルをそのような知識で取得することは、安全で一般化可能な計画に不可欠である。予測的世界モデリングにより、VLAは過去の観測から将来のビデオを予測することで、物理力学と長期因果関係を内部化することができる。しかし、素早い次世代の予測は2つの課題に直面している。 1) 意味的に異なるテキストトークンとは異なり、ビデオトークンはエントロピーが低く、冗長であり、デジェネレーションは自明な外挿になる。 2)世界モデリングは時相ジレンマを呈する:密集予測は瞬時ダイナミクスを捉えるが、長距離因果関係を効率的にモデル化することはできない。我々は,世界知識を効果的に学習するために,VLAアーキテクチャに直接統合された予測的世界モデルであるX-Foresightを導入し,世界モデリングとリアルタイム行動制御を共同で学習する。隣接するフレームではなく、意味的に離れたチャンクを予測することによって、簡単な外挿から逃れると同時に、瞬間的ダイナミクスのための密度の高いチャンク内フレームと、長期的因果関係のためのスパースチャンク間遷移を保存する。カリキュラム学習スケジュールは、予測地平線を徐々に拡張し、長期訓練を安定化させる。長期的な因果関係を効果的に把握するために,エゴモーションや行動信号によって識別される安全クリティカルなチャンクの監視に集中した時間的重要度サンプリングを提案する。さらに、拡散型多視点レンダラーに光現実性合成を委譲し、光リアル性の向上を図る。総合的な実験により、X-Foresightは、強力な生成的忠実性を維持しながら、VLAのベースラインを著しく上回っており、世界知識駆動型自律システムのための堅牢なパラダイムを確立している。

論文の概要: X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

関連論文リスト