CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields
- URL: http://arxiv.org/abs/2112.05139v1
- Date: Thu, 9 Dec 2021 18:59:55 GMT
- Title: CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields
- Authors: Can Wang and Menglei Chai and Mingming He and Dongdong Chen and Jing
Liao
- Abstract summary: We present CLIP-NeRF, a multi-modal 3D object manipulation method for neural radiance fields (NeRF)
We propose a unified framework that allows manipulating NeRF in a user-friendly way, using either a short text prompt or an exemplar image.
We evaluate our approach by extensive experiments on a variety of text prompts and exemplar images.
- Score: 33.43993665841577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present CLIP-NeRF, a multi-modal 3D object manipulation method for neural
radiance fields (NeRF). By leveraging the joint language-image embedding space
of the recent Contrastive Language-Image Pre-Training (CLIP) model, we propose
a unified framework that allows manipulating NeRF in a user-friendly way, using
either a short text prompt or an exemplar image. Specifically, to combine the
novel view synthesis capability of NeRF and the controllable manipulation
ability of latent representations from generative models, we introduce a
disentangled conditional NeRF architecture that allows individual control over
both shape and appearance. This is achieved by performing the shape
conditioning via applying a learned deformation field to the positional
encoding and deferring color conditioning to the volumetric rendering stage. To
bridge this disentangled latent representation to the CLIP embedding, we design
two code mappers that take a CLIP embedding as input and update the latent
codes to reflect the targeted editing. The mappers are trained with a
CLIP-based matching loss to ensure the manipulation accuracy. Furthermore, we
propose an inverse optimization method that accurately projects an input image
to the latent codes for manipulation to enable editing on real images. We
evaluate our approach by extensive experiments on a variety of text prompts and
exemplar images and also provide an intuitive interface for interactive
editing. Our implementation is available at
https://cassiepython.github.io/clipnerf/
Related papers
- ProteusNeRF: Fast Lightweight NeRF Editing using 3D-Aware Image Context [26.07841568311428]
We present a very simple but effective neural network architecture that is fast and efficient while maintaining a low memory footprint.
Our representation allows straightforward object selection via semantic feature distillation at the training stage.
We propose a local 3D-aware image context to facilitate view-consistent image editing that can then be distilled into fine-tuned NeRFs.
arXiv Detail & Related papers (2023-10-15T21:54:45Z) - FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural
Radiance Fields [39.57313951313061]
Existing manipulation methods require extensive human labor.
Our approach is designed to require a single text to manipulate a face reconstructed with NeRF.
Our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF.
arXiv Detail & Related papers (2023-07-21T08:22:14Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - One Model to Edit Them All: Free-Form Text-Driven Image Manipulation
with Semantic Modulations [75.81725681546071]
Free-Form CLIP aims to establish an automatic latent mapping so that one manipulation model handles free-form text prompts.
For one type of image (e.g., human portrait'), one FFCLIP model can be learned to handle free-form text prompts.
Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images.
arXiv Detail & Related papers (2022-10-14T15:06:05Z) - Towards Counterfactual Image Manipulation via CLIP [106.94502632502194]
Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images.
We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP)
We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
arXiv Detail & Related papers (2022-07-06T17:02:25Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z) - Swapping Autoencoder for Deep Image Manipulation [94.33114146172606]
We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation.
The key idea is to encode an image with two independent components and enforce that any swapped combination maps to a realistic image.
Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.
arXiv Detail & Related papers (2020-07-01T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.