Fugu-MT 論文翻訳(概要): Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

論文の概要: Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

arxiv url: http://arxiv.org/abs/2604.12309v1
Date: Tue, 14 Apr 2026 05:35:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.261973
Title: Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
Title（参考訳）: 3Dファウンデーションによるリアルかつ一貫性のある軌道ビデオ生成に向けて
Authors: Rong Wang, Ruyi Zha, Ziang Cheng, Jiayu Yang, Pulak Purkait, Hongdong Li,
Abstract要約: 本稿では,物体の単一画像からオービタルビデオを生成する新しい手法を提案する。本手法は,最先端の手法と比較して,視覚的品質,形状リアリズム,多視点整合性を実現している。
参考スコア（独自算出の注目度）: 61.34273238077091
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.
Abstract（参考訳）: 物体の単一画像から幾何学的にリアルで一貫した軌道ビデオを生成する新しい方法を提案する。既存のビデオ生成作業は、主にフレーム間の表示一貫性を強制するためにピクセル単位の注意に頼っている。しかし、そのようなメカニズムは、例えば、入力画像に対する画素対応が制限された後視野合成など、長距離外挿に十分な制約を課さない。したがって、これらの研究はしばしば、可塑性で一貫性のある構造で結果を生み出すのに失敗する。そこで本研究では, 大規模3次元アセットコーパスから学習した現実的な物体形状分布をモデル化する能力により, 3次元基本生成モデルから得られたリッチな形状の先行を補助的制約として活用することを提案する。具体的には、3Dファウンデーションモデルによって符号化された2種類の潜伏特徴で映像生成を促す。一全体的な構造指針としての認知グローバル潜伏ベクトル、及び (II)容積特徴から投影された潜像の集合で、ビュー依存的かつきめ細かな幾何学的詳細を提供する。深さや正規写像のような一般的な2.5D表現とは対照的に、これらのコンパクトな特徴は完全なオブジェクト形状をモデル化することができ、明確なメッシュ抽出を避けることで推論効率を向上させるのに役立つ。形状調整を効果的に行うために,クロスアテンションにより特徴トークンをベースビデオモデルに注入するマルチスケール3Dアダプタを導入する。複数のベンチマーク実験により,本手法は最先端の手法に比べて優れた視覚的品質,形状リアリズム,多視点整合性を実現し,複雑なカメラ軌跡や地中画像に頑健に一般化することを示した。

論文の概要: Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

関連論文リスト