Fugu-MT 論文翻訳(概要): 4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

論文の概要: 4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

arxiv url: http://arxiv.org/abs/2606.22631v1
Date: Sun, 21 Jun 2026 18:33:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:15:09.680802
Title: 4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking
Title（参考訳）: 4DVLT:Worldline-Centered Vision-Language Trackingによる動的シーン理解
Authors: Chaoyue Li, Boxue Yang, Shengyao Zhou, Haoyang Wu, Rui Qian, Linfeng Zhang,
Abstract要約: 命令条件付き4次元動的シーン理解のためのワールドライン中心タスクである textbf4DVLT を紹介する。グラフ条件付きワールドライン推論として命令条件付きトラッキングをキャストする textbf4DTrack を提案する。その結果,ワールドライン中心のモデリングにより,ターゲットの接地と回復したワールドラインの品質が向上することがわかった。
参考スコア（独自算出の注目度）: 12.29407900173211
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 4D dynamic scene understanding requires grounding language to a persistent worldline that binds identity, metric 3D motion, and synchronized multi-view 2D projections. Existing paradigms capture only part of this structure: large multimodal models reason over rich visual evidence but rarely preserve metric topology, while vision-language tracking remains tied to fragmented 2D or 3D outputs and local continuation. We therefore introduce \textbf{4DVLT}, a worldline-centered task for instruction-conditioned 4D dynamic scene understanding in fully observed multi-view video, and \textbf{Instruct-4D}, a benchmark with 129.4K question-answer pairs, 64.7K target entities, 851 scenes, and 9 reasoning-oriented query types. To address this setting, we present \textbf{4DTrack}, which casts instruction-conditioned tracking as graph-conditioned worldline inference through an object-centric 4D state graph, metric-guided routing, bidirectional decoding, and kinematic calibration. On Instruct-4D, 4DTrack-Qwen3.5-9B reaches 62.68 $\mathrm{TGA}_{\mathrm{Top1}}$ and surpasses the best adapted VLT baseline by 19.62 points. These results show that worldline-centered modeling improves both target grounding and recovered worldline quality. The project page is available at https://github.com/mikubaka88/4DVLT.
Abstract（参考訳）: 4D動的シーン理解には、アイデンティティ、メートル法3Dモーション、および同期化されたマルチビュー2Dプロジェクションを結合する永続的ワールドラインへの基底言語が必要である。大規模なマルチモーダルモデルはリッチな視覚的証拠を推論するが、計量トポロジーはまれに保存するが、視覚言語追跡は断片化された2Dまたは3D出力と局所的な継続に結びついている。そこで本研究では, マルチビュービデオにおける命令条件付き4次元動的シーン理解のためのワールドライン中心タスクである \textbf{4DVLT} と, 129.4K の質問応答対, 64.7K のターゲットエンティティ, 851 のシーン, 9 の推論指向クエリタイプを備えたベンチマークである \textbf{Instruct-4D} を紹介する。この設定に対処するために、対象中心の4D状態グラフ、メトリック誘導ルーティング、双方向デコーディング、およびキネマティックキャリブレーションを通じて、命令条件付きトラッキングをグラフ条件付きワールドライン推論としてキャストする \textbf{4DTrack} を提案する。 Instruct-4Dでは、4DTrack-Qwen3.5-9Bが62.68$\mathrm{TGA}_{\mathrm{Top1}}$に達し、最高のVLTベースラインを19.62ポイント上回る。これらの結果から,ワールドライン中心のモデリングにより,ターゲットの接地と回復したワールドラインの品質が向上することが示唆された。プロジェクトページはhttps://github.com/mikubaka88/4DVLTで公開されている。

論文の概要: 4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

関連論文リスト