Fugu-MT 論文翻訳(概要): Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians

論文の概要: Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians

arxiv url: http://arxiv.org/abs/2510.09438v1
Date: Fri, 10 Oct 2025 14:49:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:49.294927
Title: Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians
Title（参考訳）: Mono4DEditor:言語埋め込みガウスのポイントレベルローカライゼーションによるモノクロビデオからのテキスト駆動4Dシーン編集
Authors: Jin-Chuan Shi, Chengye Su, Jiajun Wang, Ariel Shamir, Miao Wang,
Abstract要約: フレキシブルで正確なテキスト駆動4Dシーン編集のためのフレームワークであるMono4DEditorを紹介する。提案手法は,3次元ガウス関数を量子化したCLIP特徴量で拡張し,言語埋め込み動的表現を生成する。 Mono4DEditorは、さまざまなシーンやオブジェクトタイプにわたる高品質でテキスト駆動の編集を可能にする。
参考スコア（独自算出の注目度）: 26.932971930852176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.
Abstract（参考訳）: テキストプロンプトに基づくモノラルビデオから再構成された4Dシーンの編集は、コンテンツ作成や仮想環境における幅広い応用において、価値のある作業である。重要な困難は、複雑で動的なシーンの局所的な領域で意味論的に正確な編集を行いながら、未編集コンテンツの完全性を維持することである。これを解決するために,テキスト駆動の4Dシーン編集を柔軟かつ正確に行う新しいフレームワークであるMono4DEditorを紹介した。提案手法は,CLIPを量子化した3次元ガウスを拡張し,言語埋め込みの動的表現を実現し,任意の空間領域の効率的なセマンティッククエリを実現する。さらに,CLIP類似性によりまず候補ガウスを選別し,その空間範囲を改良して精度を向上させる2段階の点レベルローカライズ戦略を提案する。最後に、拡散に基づくビデオ編集モデルを用いて、空間的忠実度と時間的コヒーレンスを確保するためのフローとスクリブルガイダンスを用いて、ローカライズされた領域でターゲット編集を行う。広範な実験により、Mono4DEditorは、さまざまなシーンやオブジェクトタイプにわたる高品質でテキスト駆動の編集を可能にし、未編集領域の外観と幾何学を保存し、柔軟性と視覚的忠実性の両方において以前のアプローチを超えることが示されている。

論文の概要: Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians

関連論文リスト