Local 3D Editing via 3D Distillation of CLIP Knowledge
- URL: http://arxiv.org/abs/2306.12570v1
- Date: Wed, 21 Jun 2023 21:09:45 GMT
- Title: Local 3D Editing via 3D Distillation of CLIP Knowledge
- Authors: Junha Hyung, Sungwon Hwang, Daejin Kim, Hyunji Lee, Jaegul Choo
- Abstract summary: 3D content manipulation is an important computer vision task with many real-world applications.
Recent proposed 3D GANs can generate diverse photorealistic 3D-aware contents using Neural Radiance fields (NeRF)
We propose Local Editing NeRF (LENeRF), which only requires text inputs for fine-grained and localized manipulation.
- Score: 26.429032648560018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D content manipulation is an important computer vision task with many
real-world applications (e.g., product design, cartoon generation, and 3D
Avatar editing). Recently proposed 3D GANs can generate diverse photorealistic
3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of
NeRF still remains a challenging problem since the visual quality tends to
degrade after manipulation and suboptimal control handles such as 2D semantic
maps are used for manipulations. While text-guided manipulations have shown
potential in 3D editing, such approaches often lack locality. To overcome these
problems, we propose Local Editing NeRF (LENeRF), which only requires text
inputs for fine-grained and localized manipulation. Specifically, we present
three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field
Network, and the Deformation Network, which are jointly used for local
manipulations of 3D features by estimating a 3D attention field. The 3D
attention field is learned in an unsupervised way, by distilling the zero-shot
mask generation capability of CLIP to the 3D space with multi-view guidance. We
conduct diverse experiments and thorough evaluations both quantitatively and
qualitatively.
Related papers
- Revealing Directions for Text-guided 3D Face Editing [52.85632020601518]
3D face editing is a significant task in multimedia, aimed at the manipulation of 3D face models across various control signals.
We present Face Clan, a text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions.
Our method offers a precisely controllable manipulation method, allowing users to intuitively customize regions of interest with the text description.
arXiv Detail & Related papers (2024-10-07T12:04:39Z) - iControl3D: An Interactive System for Controllable 3D Scene Generation [57.048647153684485]
iControl3D is a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control.
We leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation.
Our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene.
arXiv Detail & Related papers (2024-08-03T06:35:09Z) - ${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields [33.168225243348786]
We present a single model, Multi-Modal Decomposition NeRF ($M2D$NeRF), that is capable of both text-based and visual patch-based edits.
Specifically, we use multi-modal feature distillation to integrate teacher features from pretrained visual and language models into 3D semantic feature volumes.
Experiments on various real-world scenes show superior performance in 3D scene decomposition tasks compared to prior NeRF-based methods.
arXiv Detail & Related papers (2024-05-08T12:25:21Z) - NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields [57.617972778377215]
We show how to generate effective 3D representations from posed RGB images.
We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images.
Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks.
arXiv Detail & Related papers (2024-04-01T17:59:55Z) - Weakly Supervised Monocular 3D Detection with a Single-View Image [58.57978772009438]
Monocular 3D detection aims for precise 3D object localization from a single-view image.
We propose SKD-WM3D, a weakly supervised monocular 3D detection framework.
We show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.
arXiv Detail & Related papers (2024-02-29T13:26:47Z) - Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D
Image Representations [92.88108411154255]
We present a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene.
We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels, but also consistently improves over the self-supervised 2D baselines.
arXiv Detail & Related papers (2022-09-07T23:24:09Z) - DM-NeRF: 3D Scene Geometry Decomposition and Manipulation from 2D Images [15.712721653893636]
DM-NeRF is among the first to simultaneously reconstruct, decompose, manipulate and render complex 3D scenes in a single pipeline.
Our method can accurately decompose all 3D objects from 2D views, allowing any interested object to be freely manipulated in 3D space.
arXiv Detail & Related papers (2022-08-15T14:32:10Z) - Decomposing NeRF for Editing via Feature Field Distillation [14.628761232614762]
editing a scene represented by a NeRF is challenging as the underlying connectionist representations are not object-centric or compositional.
In this work, we tackle the problem of semantic scene decomposition of NeRFs to enable query-based local editing.
We propose to distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors into a 3D feature field optimized in parallel to the radiance field.
arXiv Detail & Related papers (2022-05-31T07:56:09Z) - DRaCoN -- Differentiable Rasterization Conditioned Neural Radiance
Fields for Articulated Avatars [92.37436369781692]
We present DRaCoN, a framework for learning full-body volumetric avatars.
It exploits the advantages of both the 2D and 3D neural rendering techniques.
Experiments on the challenging ZJU-MoCap and Human3.6M datasets indicate that DRaCoN outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-03-29T17:59:15Z) - 3D-aware Image Synthesis via Learning Structural and Textural
Representations [39.681030539374994]
We propose VolumeGAN, for high-fidelity 3D-aware image synthesis, through explicitly learning a structural representation and a textural representation.
Our approach achieves sufficiently higher image quality and better 3D control than the previous methods.
arXiv Detail & Related papers (2021-12-20T18:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.