Fugu-MT 論文翻訳(概要): Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

論文の概要: Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

arxiv url: http://arxiv.org/abs/2604.17651v1
Date: Sun, 19 Apr 2026 22:50:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.618854
Title: Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
Title（参考訳）: インフラ中心の世界モデル:道路側知覚のための時間的深さと空間的幅
Authors: Siyuan Meng, Chengbo Ai,
Abstract要約: 本稿では,インフラ中心の世界モデル(I-WM)を3段階に展開する。本稿では,マルチモーダルなデータエンジンとして,多層アーキテクチャ,アノテーションなし認識,エンド・ツー・エンドな生成世界モデルを提案する。我々は,世界モデルを駆動するパラダイムの分類,LeCunのJEPA,Li Fei-Feiの空間知性,VLAアーキテクチャに対するI-WMの位置づけを確立する。
参考スコア（独自算出の注目度）: 3.3242611619309614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: World models, generative AI systems that simulate how environments evolve, are transforming autonomous driving, yet all existing approaches adopt an ego-vehicle perspective, leaving the infrastructure viewpoint unexplored. We argue that infrastructure-centric world models offer a fundamentally complementary capability: the bird's-eye, multi-sensor, persistent viewpoint that roadside systems uniquely possess. Central to our thesis is a spatio-temporal complementarity: fixed roadside sensors excel at temporal depth, accumulating long-term behavioral distributions including rare safety-critical events, while vehicle-borne sensors excel at spatial breadth, sampling diverse scenes across large road networks. This paper presents a vision for Infrastructure-centric World Models (I-WM) in three phases: (I) generative scene understanding with quality-aware uncertainty propagation, (II) physics-informed predictive dynamics with multi-agent counterfactual reasoning, and (III) collaborative world models for V2X communication via latent space alignment. We propose a dual-layer architecture, annotation-free perception as a multi-modal data engine feeding end-to-end generative world models, with a phased sensor strategy from LiDAR through 4D radar and signal phase data to event cameras. We establish a taxonomy of driving world model paradigms, position I-WM relative to LeCun's JEPA, Li Fei-Fei's spatial intelligence, and VLA architectures, and introduce Infrastructure VLA (I-VLA) as a novel unification of roadside perception, language commands, and traffic control actions. Our vision builds upon existing multi-LiDAR pipelines and identifies open-source foundations for each phase, providing a path toward infrastructure that understands and anticipates traffic.
Abstract（参考訳）: 世界モデル、環境の進化をシミュレートする生成AIシステムは、自動運転を変革していますが、既存のアプローチはすべて、Ego-Vhicleの視点を採用しています。我々は、インフラ中心の世界モデルは基本的に相補的な能力をもたらすと論じている。固定された道路側センサーは時間的深さで、稀な安全クリティカルなイベントを含む長期的な行動分布を蓄積し、車両搭載センサーは空間的幅で、多様な道路網を網羅する多様なシーンを抽出する。本稿では, インフラストラクチャ中心の世界モデル(I-WM)を, (I) 品質を意識した不確実性伝播を伴う生成的シーン理解, (II) 物理インフォームド予測力学と, (III) 潜在空間アライメントによるV2X通信のための協調的世界モデルという3段階のビジョンを示す。我々は,LDARから4Dレーダ,およびイベントカメラへの信号位相データを通じて,エンド・ツー・エンドの世代モデルを提供するマルチモーダルデータエンジンとしての2層アーキテクチャ,アノテーションなし認識を提案する。我々は,LeCunのJEPA,Li Fei-Feiの空間知性,VLAアーキテクチャと相対的な位置I-WMを駆動する世界モデルパラダイムの分類を確立し,道路側の認識,言語コマンド,交通制御行動の新たな統合としてインフラストラクチャVLA(I-VLA)を導入する。私たちのビジョンは、既存のマルチLiDARパイプラインの上に構築され、各フェーズのオープンソース基盤を特定し、トラフィックを理解し、予測するインフラストラクチャへのパスを提供します。

論文の概要: Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

関連論文リスト