Fugu-MT 論文翻訳(概要): OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

論文の概要: OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

arxiv url: http://arxiv.org/abs/2606.17536v1
Date: Tue, 16 Jun 2026 05:25:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.287358
Title: OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation
Title（参考訳）: OmniDrive:マルチビュー駆動ビデオ生成のための統合潜在コ圧縮によるLLM-Choreographed Multi-Agent World Model
Authors: Zijie Meng, Yufei Liu, Chengqian Ma, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Shuqin Chen, Weichen Xu, Jiquan Yuan, Miao Zhang,
Abstract要約: DRIVE-CHOREOは、制御可能なマルチビュービデオ生成を潜在コレオグラフィとして再放送する。 NUScenesでは、DRIVE-CHOREOが新しい最先端のマルチビュー一貫性と競合するFVD (45.7) を備えたBEV mAP (21.6) を設定します。
参考スコア（独自算出の注目度）: 23.42968075775045
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.
Abstract（参考訳）: 自律運転のための生成的世界モデルは、自由形式言語、HDマップ、軌跡、カメラのポーズが互換性のない表現空間に存在する異種制御注入(英語版)と、カメラごとのラプタントがグローバルな3次元幾何学を符号化できないポストホックなクロスビュー融合(英語版)の2つの未解決の緊張に直面している。我々はどちらも一つの根本原因に辿り着く: 言語、幾何学、ピクセルを潜在トークンレベルで共有するシンボリックインターリングアが存在しないこと。 LLM-choreographed multi-agent world model, DRIVE-CHOREOについて述べる。 3つのQwen2.5-VLエージェント - ユーザ意図を構造化されたWorldScriptにパースするディレクタ、それを空間的にアンコールされたレイアウトトークンにグラウンドするCartographer、補助的なインシデントとしてクロスビューの批評を返送するAuditor、そして単一の位置認識トークンシーケンスを共同で作成する。このシーケンスは、3次元VAEの畳み込み受容領域内でカメラ間幾何学を強制するビュータイムの置換によって、マルチビュービデオと共圧縮される。 NUScenesでは、DRIVE-CHOREOが新しい最先端のマルチビュー一貫性と競合するFVD (45.7) を備えたBEV mAP (21.6) を設定します。

論文の概要: OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

関連論文リスト