Fugu-MT 論文翻訳(概要): Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

論文の概要: Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

arxiv url: http://arxiv.org/abs/2512.08924v2
Date: Wed, 10 Dec 2025 14:53:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-11 15:14:53.229141
Title: Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Title（参考訳）: 動的シーン1D4RTを1時間で効率的に再構築する
Authors: Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle K. Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi S. M. Sajjadi,
Abstract要約: 本稿では、このタスクを効率的に解くために設計された、シンプルながら強力なフィードフォワードモデルであるD4RTを紹介する。我々のデコードインタフェースにより、モデルは独立して、空間と時間の任意の点の3D位置を柔軟にプローブすることができる。提案手法は,従来の手法よりも広い範囲の4次元再構成作業に優れることを示す。
参考スコア（独自算出の注目度）: 54.67332582569525
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
Abstract（参考訳）: ビデオからダイナミックシーンの複雑な幾何学と動きを理解し、再構築することは、コンピュータビジョンにおける大きな課題である。本稿では、このタスクを効率的に解くために設計された、シンプルながら強力なフィードフォワードモデルであるD4RTを紹介する。 D4RTは、統合トランスフォーマーアーキテクチャを使用して、単一のビデオから深度、時空間対応、フルカメラパラメータを共同で推論する。その中核的なイノベーションは、高密度なフレーム単位のデコーディングの重い計算と、複数のタスク固有のデコーダを管理する複雑さを横取りする、新しいクエリメカニズムである。我々のデコードインタフェースにより、モデルは独立して、空間と時間の任意の点の3D位置を柔軟にプローブすることができる。その結果、非常に効率的なトレーニングと推論を可能にする、軽量でスケーラブルな方法が実現した。提案手法は,4次元再構成タスクの幅広い範囲において,従来の手法よりも優れた新しい最先端の手法を設定できることを実証する。アニメーションの結果については、プロジェクトのWebページを参照しよう。

論文の概要: Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

関連論文リスト