Customizing Text-to-Image Diffusion with Object Viewpoint Control
- URL: http://arxiv.org/abs/2404.12333v2
- Date: Mon, 02 Dec 2024 21:15:29 GMT
- Title: Customizing Text-to-Image Diffusion with Object Viewpoint Control
- Authors: Nupur Kumari, Grace Su, Richard Zhang, Taesung Park, Eli Shechtman, Jun-Yan Zhu,
- Abstract summary: We introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models.
This allows us to modify the custom object's properties and generate it in various background scenes via text prompts.
We propose to condition the diffusion process on the 3D object features rendered from the target viewpoint.
- Score: 53.621518249820745
- License:
- Abstract: Model customization introduces new concepts to existing text-to-image models, enabling the generation of these new concepts/objects in novel contexts. However, such methods lack accurate camera view control with respect to the new object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models. This allows us to modify the custom object's properties and generate it in various background scenes via text prompts, all while incorporating the object viewpoint as an additional control. This new task presents significant challenges, as one must harmoniously merge a 3D representation from the multi-view images with the 2D pre-trained model. To bridge this gap, we propose to condition the diffusion process on the 3D object features rendered from the target viewpoint. During training, we fine-tune the 3D feature prediction modules to reconstruct the object's appearance and geometry, while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model customization baselines in preserving the custom object's identity while following the target object viewpoint and the text prompt.
Related papers
- Instructive3D: Editing Large Reconstruction Models with Text Instructions [2.9575146209034853]
Instructive3D is a novel LRM based model that integrates generation and fine-grained editing, through user text prompts, of 3D objects into a single model.
We find that Instructive3D produces superior 3D objects with the properties specified by the edit prompts.
arXiv Detail & Related papers (2025-01-08T09:28:25Z) - Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models [78.90023746996302]
Add-it is a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources.
Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement.
Human evaluations show that Add-it is preferred in over 80% of cases.
arXiv Detail & Related papers (2024-11-11T18:50:09Z) - Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models [32.51506331929564]
We propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene.
Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets.
arXiv Detail & Related papers (2024-06-13T16:29:18Z) - CustomNet: Zero-shot Object Customization with Variable-Viewpoints in
Text-to-Image Diffusion Models [85.69959024572363]
CustomNet is a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process.
We introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images.
Our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background.
arXiv Detail & Related papers (2023-10-30T17:50:14Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - AutoRF: Learning 3D Object Radiance Fields from Single View Observations [17.289819674602295]
AutoRF is a new approach for learning neural 3D object representations where each object in the training set is observed by only a single view.
We show that our method generalizes well to unseen objects, even across different datasets of challenging real-world street scenes.
arXiv Detail & Related papers (2022-04-07T17:13:39Z) - Learning Object-Centric Representations of Multi-Object Scenes from
Multiple Views [9.556376932449187]
Multi-View and Multi-Object Network (MulMON) is a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views.
We show that MulMON better-resolves spatial ambiguities than single-view methods.
arXiv Detail & Related papers (2021-11-13T13:54:28Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.