Fugu-MT 論文翻訳(概要): VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

論文の概要: VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

arxiv url: http://arxiv.org/abs/2605.15186v2
Date: Tue, 19 May 2026 03:07:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.361448
Title: VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
Title（参考訳）: VGGT-Edit: 残留場予測によるフィードフォワードネイティブ3Dシーン編集
Authors: Kaixin Zhu, Yiwen Tang, Yifan Yang, Renrui Zhang, Bohan Zeng, Ziyu Guo, Ruichuan An, Zhou Liu, Qizhi Chen, Delin Qu, Jaehong Yoon, Wentao Zhang,
Abstract要約: VGGT-Editはテキスト条件のネイティブ3Dシーン編集のためのフィードフォワードフレームワークである。本研究では,奥行き同期テキストインジェクションを導入し,意味的指導をバックボーンの空間的ポーズと整合させる。 VGGT-Editは2Dリフトベースラインを大幅に上回り、よりシャープなオブジェクトの詳細、より強力なマルチビュー一貫性、ほぼインスタントな推論速度を生み出している。
参考スコア（独自算出の注目度）: 59.303842406260124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed. The project page is https://chriszkxxx.github.io/VGGT-Edit/.
Abstract（参考訳）: 高品質な3Dシーン再構成は、最近、一般化可能なフィードフォワードアーキテクチャに向けて進歩し、単一のフォワードパスで複雑な環境を生成できるようになった。しかし,静的シーン認識の性能は高いが,対話型アプリケーションでは使用が制限される動的なヒューマンインストラクションに応答する場合には,これらのモデルに制限がある。既存の編集方法は一般的に2Dリフト方式に依存しており、個々のビューは独立して編集され、3D空間に戻される。この間接的なパイプラインは、2Dエディターが視点を越えて構造を保存するのに必要な空間的認識を欠いているため、ぼやけたテクスチャや不整合幾何学につながることが多い。これらの制約に対処するため,テキスト条件付きネイティブ3Dシーン編集のためのフィードフォワードフレームワークであるVGGT-Editを提案する。 VGGT-Editは、奥行き同期テキストインジェクションを導入し、セマンティックガイダンスをバックボーンの空間的なポーズと整合させ、安定した命令グラウンドを確実にする。このセマンティック信号は残留変換ヘッドによって処理され、背景安定性を維持しながら3次元幾何学的変位を直接予測してシーンを変形させる。高忠実度を確実にするために,幾何的精度とクロスビュー整合性を実現する多目的関数を用いてフレームワークを監督する。 DeltaScene Datasetも構築しています。このデータセットは,3Dコンセンサスフィルタリングを備えた自動パイプラインを通じて生成された大規模データセットで,地中品質の確保を目的としています。実験により、VGGT-Editは2Dリフトベースラインを大幅に上回り、よりシャープなオブジェクトの詳細、より強力なマルチビュー一貫性、ほぼインスタントな推論速度を生み出した。プロジェクトページはhttps://chriszkxxx.github.io/VGGT-Edit/。

論文の概要: VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

関連論文リスト