Fugu-MT 論文翻訳(概要): Simulating the Visual World with Artificial Intelligence: A Roadmap

論文の概要: Simulating the Visual World with Artificial Intelligence: A Roadmap

arxiv url: http://arxiv.org/abs/2511.08585v1
Date: Wed, 12 Nov 2025 02:05:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-12 20:17:03.874726
Title: Simulating the Visual World with Artificial Intelligence: A Roadmap
Title（参考訳）: ビジュアルワールドを人工知能でシミュレーションする:ロードマップ
Authors: Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu,
Abstract要約: ビデオ生成は、視覚的に魅力的なクリップを生成するものから、インタラクションをサポートし、物理的な可視性を維持する仮想環境を構築するものへとシフトしている。この調査は、この進化の体系的な概要を提供し、現代のビデオ基盤モデルを2つのコアコンポーネントの組み合わせとして概念化した。 4世代にわたる映像生成の進展を追究し,本質的な物理的妥当性を具現化した映像生成モデルを構築した。
参考スコア（独自算出の注目度）: 48.64639618440864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.
Abstract（参考訳）: ビデオ生成の展望は、視覚的に魅力的なクリップの生成から、インタラクションをサポートし、物理的な可視性を維持する仮想環境の構築へと変化しつつある。これらの発展は、ヴィジュアルジェネレータとしてだけでなく、暗黙の世界モデルとしても機能するビデオ基盤モデルの出現、物理力学をシミュレートするモデル、エージェントと環境の相互作用、現実または想像された世界を統治するタスク計画の出現を指している。この調査は、この進化の体系的な概要を提供し、現代のビデオ基盤モデルを、暗黙の世界モデルとビデオレンダラーの2つのコアコンポーネントの組み合わせとして概念化した。世界モデルは、物理法則、相互作用力学、エージェント行動を含む世界に関する構造化された知識を符号化する。これは、コヒーレントな視覚的推論、長期的時間的一貫性、ゴール駆動計画を可能にする潜在シミュレーションエンジンとして機能する。ビデオレンダラーは、この潜伏シミュレーションをリアルな視覚的な観察に変換し、効果的にシミュレーションされた世界への「窓」としてビデオを生成する。ビデオ生成モデル上に構築され、本質的な物理的可視性、リアルタイムなマルチモーダルインタラクション、複数の時空間スケールにまたがる計画能力を示す。各世代ごとに、中核となる特徴を定義し、代表作品のハイライトを行い、ロボット工学、自律運転、インタラクティブゲームなどの応用分野について検討する。最後に, エージェント・インテリジェンス(エージェント・インテリジェンス, エージェント・インテリジェンス, エージェント・インテリジェンス, エージェント・インテリジェンス, エージェント・インテリジェンス, エージェント・インテリジェンス, エージェント・インテリジェンス, エージェント・インテリジェンス, エージェント・インテリジェンス, エージェントこのリンクでは、関連作品の最新のリストが維持されている。

論文の概要: Simulating the Visual World with Artificial Intelligence: A Roadmap

関連論文リスト