Fugu-MT 論文翻訳(概要): Grounding World Simulation Models in a Real-World Metropolis

論文の概要: Grounding World Simulation Models in a Real-World Metropolis

arxiv url: http://arxiv.org/abs/2603.15583v1
Date: Mon, 16 Mar 2026 17:46:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.708368
Title: Grounding World Simulation Models in a Real-World Metropolis
Title（参考訳）: 実世界都市における地盤シミュレーションモデル
Authors: Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim,
Abstract要約: 実都市ソウルを基盤とした都市規模の世界モデルであるソウル世界モデル(SWM)を提示する。 SWMは、近くのストリートビュー画像の検索強化条件付けにより、自動回帰ビデオ生成をアンカーする。我々は、ソウル、釜山、アン・アーバーの3都市における最近のビデオワールドモデルに対してSWMを評価した。
参考スコア（独自算出の注目度）: 80.10324496369951
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.
Abstract（参考訳）: もし世界シミュレーションモデルが、想像された環境ではなく、実際に存在する都市を表現できたらどうだろう? 以前の生成世界モデルは、すべてのコンテンツを想像することで、視覚的に可視だが人工的な環境を合成する。実都市ソウルを基盤とした都市規模の世界モデルであるソウル世界モデル(SWM)を提示する。 SWMは、近くのストリートビュー画像の検索強化条件付けにより、自動回帰ビデオ生成をアンカーする。しかし、この設計では、検索された参照と動的ターゲットシーンの時間的ミスアライメント、軌道の多様性の制限、車両に搭載されたキャプチャからのデータの分散など、いくつかの課題が導入されている。横断的なペアリング,多彩なカメラトラジェクトリを実現する大規模合成データセット,および疎ストリートビュー画像からコヒーレントなトレーニングビデオを合成するビュー補間パイプラインを通じて,これらの課題に対処する。さらに,各チャンクを検索した画像に連続的に再接地することで,長軸生成の安定化を図る仮想ルックアヘッドシンクを導入する。我々は、ソウル、釜山、アン・アーバーの3都市における最近のビデオワールドモデルに対してSWMを評価した。 SWMは、空間的に忠実で、時間的に一貫したロングホライゾン動画を実際の都市環境に設置し、数百メートルの軌跡に到達し、多様なカメラの動きとテキストプロンプトシナリオのバリエーションをサポートしながら、既存の手法より優れています。

論文の概要: Grounding World Simulation Models in a Real-World Metropolis

関連論文リスト