TopNet: Transformer-based Object Placement Network for Image Compositing
- URL: http://arxiv.org/abs/2304.03372v1
- Date: Thu, 6 Apr 2023 20:58:49 GMT
- Title: TopNet: Transformer-based Object Placement Network for Image Compositing
- Authors: Sijie Zhu, Zhe Lin, Scott Cohen, Jason Kuen, Zhifei Zhang, Chen Chen
- Abstract summary: Local clues in background images are important to determine the compatibility of placing objects with certain locations/scales.
We propose to learn the correlation between object features and all local background features with a transformer module.
Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass.
- Score: 43.14411954867784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the problem of automatically placing an object into a
background image for image compositing. Given a background image and a
segmented object, the goal is to train a model to predict plausible placements
(location and scale) of the object for compositing. The quality of the
composite image highly depends on the predicted location/scale. Existing works
either generate candidate bounding boxes or apply sliding-window search using
global representations from background and object images, which fail to model
local information in background images. However, local clues in background
images are important to determine the compatibility of placing the objects with
certain locations/scales. In this paper, we propose to learn the correlation
between object features and all local background features with a transformer
module so that detailed information can be provided on all possible
location/scale configurations. A sparse contrastive loss is further proposed to
train our model with sparse supervision. Our new formulation generates a 3D
heatmap indicating the plausibility of all location/scale combinations in one
network forward pass, which is over 10 times faster than the previous
sliding-window method. It also supports interactive search when users provide a
pre-defined location or scale. The proposed method can be trained with explicit
annotation or in a self-supervised manner using an off-the-shelf inpainting
model, and it outperforms state-of-the-art methods significantly. The user
study shows that the trained model generalizes well to real-world images with
diverse challenging scenes and object categories.
Related papers
- Breaking the Frame: Image Retrieval by Visual Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach, VOP, that efficiently addresses occlusions and complex scenes.
The proposed method enables the identification of visible image sections without requiring expensive feature detection and matching.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - FoundPose: Unseen Object Pose Estimation with Foundation Features [11.32559845631345]
FoundPose is a model-based method for 6D pose estimation of unseen objects from a single RGB image.
The method can quickly onboard new objects using their 3D models without requiring any object- or task-specific training.
arXiv Detail & Related papers (2023-11-30T18:52:29Z) - AnyDoor: Zero-shot Object-level Image Customization [63.44307304097742]
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations.
Our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage.
arXiv Detail & Related papers (2023-07-18T17:59:02Z) - Learning-based Relational Object Matching Across Views [63.63338392484501]
We propose a learning-based approach which combines local keypoints with novel object-level features for matching object detections between RGB images.
We train our object-level matching features based on appearance and inter-frame and cross-frame spatial relations between objects in an associative graph neural network.
arXiv Detail & Related papers (2023-05-03T19:36:51Z) - MeshLoc: Mesh-Based Visual Localization [54.731309449883284]
We explore a more flexible alternative based on dense 3D meshes that does not require features matching between database images to build the scene representation.
Surprisingly competitive results can be obtained when extracting features on renderings of these meshes, without any neural rendering stage.
Our results show that dense 3D model-based representations are a promising alternative to existing representations and point to interesting and challenging directions for future research.
arXiv Detail & Related papers (2022-07-21T21:21:10Z) - GALA: Toward Geometry-and-Lighting-Aware Object Search for Compositing [43.14411954867784]
GALA is a generic foreground object search method with discriminative modeling on geometry and lighting compatibility.
It generalizes well on large-scale open-world datasets, i.e. Pixabay and Open Images.
In addition, our method can effectively handle non-box scenarios, where users only provide background images without any input bounding box.
arXiv Detail & Related papers (2022-03-31T22:36:08Z) - Complex Scene Image Editing by Scene Graph Comprehension [17.72638225034884]
We propose a two-stage method for achieving complex scene image editing by Scene Graph (SGC-Net)
In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects.
The second stage uses a conditional diffusion model to edit the image based on our RoI predictions.
arXiv Detail & Related papers (2022-03-24T05:12:54Z) - Unsupervised Layered Image Decomposition into Object Prototypes [39.20333694585477]
We present an unsupervised learning framework for decomposing images into layers of automatically discovered object models.
We first validate our approach by providing results on par with the state of the art on standard multi-object synthetic benchmarks.
We then demonstrate the applicability of our model to real images in tasks that include clustering (SVHN, GTSRB), cosegmentation (Weizmann Horse) and object discovery from unfiltered social network images.
arXiv Detail & Related papers (2021-04-29T18:02:01Z) - Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization.
We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning.
Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z) - Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task.
We combine the resulting representations with global scene information for accurately predicting visual saliency.
Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.