Fugu-MT 論文翻訳(概要): BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

論文の概要: BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

arxiv url: http://arxiv.org/abs/2602.22596v1
Date: Thu, 26 Feb 2026 03:58:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-27 18:41:22.519646
Title: BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model
Title（参考訳）: BetterScene:Representation-Aligned Generative Modelを用いた3次元シーン合成
Authors: Yuci Han, Charles Toth, John E. Anderson, William J. Shuart, Alper Yilmaz,
Abstract要約: 我々は,高度にスパースで制約のない写真を用いて,多様な現実世界のシーンに対して,新しいビュー合成(NVS)の品質を高めるアプローチであるBetterSceneを提案する。 BetterSceneは、数十億のフレームで事前訓練されたプロダクション対応の安定ビデオ拡散(SVD)モデルを強力なバックボーンとして活用する。我々は,挑戦的なDL3DV-10Kデータセットを評価し,最先端手法と比較して優れた性能を示した。
参考スコア（独自算出の注目度）: 3.7515646463759698
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.
Abstract（参考訳）: BetterSceneは、非常にスパースで制約のない写真を用いて、多様な現実世界のシーンに対して、斬新なビュー合成(NVS)品質を向上させるアプローチである。 BetterSceneは、数十億のフレームで事前訓練されたプロダクション対応の安定ビデオ拡散(SVD)モデルを強力なバックボーンとして活用し、アーティファクトを緩和し、推論時にビュー一貫性の詳細を復元する。従来の方法では、新しいビュー合成のこれらの課題に対処するために、同様の拡散ベースのソリューションが開発されている。大幅な改善にもかかわらず、これらの手法は通常、既製の事前訓練された拡散先とUNetモジュールのみを微調整し、他のコンポーネントを凍結させながら、深さやセマンティック条件のような幾何学的に認識された正規化を組み込んだ場合でも、矛盾した詳細やアーティファクトをもたらす。そこで本研究では,拡散モデルの潜時空間について検討し,(1)時間的等分散正規化と(2)視覚基盤モデル整合表現の2つの要素をSVDパイプライン内の変分オートエンコーダ(VAE)モジュールに適用する。 BetterSceneはフィードフォワード3Dガウススプラッティング(3DGS)モデルを統合し、SVDエンハンサーの入力として機能をレンダリングし、連続的でアーチファクトフリーで一貫した新しいビューを生成する。我々は,挑戦的なDL3DV-10Kデータセットを評価し,最先端手法と比較して優れた性能を示した。

論文の概要: BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

関連論文リスト