Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians
- URL: http://arxiv.org/abs/2510.09438v1
- Date: Fri, 10 Oct 2025 14:49:49 GMT
- Title: Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians
- Authors: Jin-Chuan Shi, Chengye Su, Jiajun Wang, Ariel Shamir, Miao Wang,
- Abstract summary: We introduce Mono4DEditor, a framework for flexible and accurate text-driven 4D scene editing.<n>Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation.<n>We show that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types.
- Score: 26.932971930852176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.
Related papers
- InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning [60.799998743918955]
We propose a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes.<n>The key insight of InterCoG is to first perform object position reasoning solely within text.<n>We also propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment.
arXiv Detail & Related papers (2026-03-02T08:13:16Z) - Mastering Regional 3DGS: Locating, Initializing, and Editing with Diverse 2D Priors [67.22744959435708]
3D semantic parsing often underperforms compared to its 2D counterpart, making targeted manipulations within 3D spaces more difficult and limiting the fidelity of edits.<n>We address this problem by leveraging 2D diffusion editing to accurately identify modification regions in each view, followed by inverse rendering for 3D localization.<n> Experiments demonstrate that our method achieves state-of-the-art performance while delivering up to a $4times$ speedup.
arXiv Detail & Related papers (2025-07-07T19:15:43Z) - PrEditor3D: Fast and Precise 3D Shape Editing [100.09112677669376]
We propose a training-free approach to 3D editing that enables the editing of a single shape within a few minutes.<n>The edited 3D mesh aligns well with the prompts, and remains identical for regions that are not intended to be altered.
arXiv Detail & Related papers (2024-12-09T15:44:47Z) - GSEditPro: 3D Gaussian Splatting Editing with Attention-based Progressive Localization [11.170354299559998]
We propose GSEditPro, a novel 3D scene editing framework which allows users to perform various creative and precise editing using text prompts only.
We introduce an attention-based progressive localization module to add semantic labels to each Gaussian during rendering.
This enables precise localization on editing areas by classifying Gaussians based on their relevance to the editing prompts derived from cross-attention layers of the T2I model.
arXiv Detail & Related papers (2024-11-15T08:25:14Z) - EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing [114.14164860467227]
We propose EditRoom, a framework capable of executing a variety of layout edits through natural language commands.<n>Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes.<n>We have developed an automatic pipeline to augment existing 3D scene datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs.
arXiv Detail & Related papers (2024-10-03T17:42:24Z) - TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts [119.84478647745658]
TIPEditor is a 3D scene editing framework that accepts both text and image prompts and a 3D bounding box to specify the editing region.
Experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region.
arXiv Detail & Related papers (2024-01-26T12:57:05Z) - LatentEditor: Text Driven Local Editing of 3D Scenes [8.966537479017951]
We introduce textscLatentEditor, a framework for precise and locally controlled editing of neural fields using text prompts.
We successfully embed real-world scenes into the latent space, resulting in a faster and more adaptable NeRF backbone for editing.
Our approach achieves faster editing speeds and superior output quality compared to existing 3D editing models.
arXiv Detail & Related papers (2023-12-14T19:38:06Z) - 4D-Editor: Interactive Object-level Editing in Dynamic Neural Radiance
Fields via Semantic Distillation [2.027159474140712]
We propose 4D-Editor, an interactive semantic-driven editing framework, for editing dynamic NeRFs.
We propose an extension to the original dynamic NeRF by incorporating a hybrid semantic feature distillation to maintain spatial-temporal consistency after editing.
In addition, we develop Multi-view Reprojection Inpainting to fill holes caused by incomplete scene capture after editing.
arXiv Detail & Related papers (2023-10-25T02:20:03Z) - DreamEditor: Text-Driven 3D Scene Editing with Neural Fields [115.07896366760876]
We propose a novel framework that enables users to edit neural fields using text prompts.
DreamEditor generates highly realistic textures and geometry, significantly surpassing previous works in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2023-06-23T11:53:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.