Related papers: TopNet: Transformer-based Object Placement Network for Image Compositing

TopNet: Transformer-based Object Placement Network for Image Compositing

URL: http://arxiv.org/abs/2304.03372v1
Date: Thu, 6 Apr 2023 20:58:49 GMT
Title: TopNet: Transformer-based Object Placement Network for Image Compositing
Authors: Sijie Zhu, Zhe Lin, Scott Cohen, Jason Kuen, Zhifei Zhang, Chen Chen
Abstract summary: Local clues in background images are important to determine the compatibility of placing objects with certain locations/scales. We propose to learn the correlation between object features and all local background features with a transformer module. Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass.
Score: 43.14411954867784
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate the problem of automatically placing an object into a background image for image compositing. Given a background image and a segmented object, the goal is to train a model to predict plausible placements (location and scale) of the object for compositing. The quality of the composite image highly depends on the predicted location/scale. Existing works either generate candidate bounding boxes or apply sliding-window search using global representations from background and object images, which fail to model local information in background images. However, local clues in background images are important to determine the compatibility of placing the objects with certain locations/scales. In this paper, we propose to learn the correlation between object features and all local background features with a transformer module so that detailed information can be provided on all possible location/scale configurations. A sparse contrastive loss is further proposed to train our model with sparse supervision. Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass, which is over 10 times faster than the previous sliding-window method. It also supports interactive search when users provide a pre-defined location or scale. The proposed method can be trained with explicit annotation or in a self-supervised manner using an off-the-shelf inpainting model, and it outperforms state-of-the-art methods significantly. The user study shows that the trained model generalizes well to real-world images with diverse challenging scenes and object categories.

Related papers

Generative Location Modeling for Spatially Aware Object Insertion [35.62317512925592]
Generative models have become a powerful tool for image editing tasks, including object insertion. In this paper, we focus on creating a location model dedicated to identifying realistic object locations. Specifically, we train an autoregressive model that generates bounding box coordinates, conditioned on the background image and the desired object class. This formulation allows to effectively handle sparse placement annotations and to incorporate implausible locations into a preference dataset by performing direct preference optimization.
arXiv Detail & Related papers (2024-10-17T14:00:41Z)
AnyDoor: Zero-shot Object-level Image Customization [63.44307304097742]
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations. Our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage.
arXiv Detail & Related papers (2023-07-18T17:59:02Z)
Learning-based Relational Object Matching Across Views [63.63338392484501]
We propose a learning-based approach which combines local keypoints with novel object-level features for matching object detections between RGB images. We train our object-level matching features based on appearance and inter-frame and cross-frame spatial relations between objects in an associative graph neural network.
arXiv Detail & Related papers (2023-05-03T19:36:51Z)
MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training. We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z)
MeshLoc: Mesh-Based Visual Localization [54.731309449883284]
We explore a more flexible alternative based on dense 3D meshes that does not require features matching between database images to build the scene representation. Surprisingly competitive results can be obtained when extracting features on renderings of these meshes, without any neural rendering stage. Our results show that dense 3D model-based representations are a promising alternative to existing representations and point to interesting and challenging directions for future research.
arXiv Detail & Related papers (2022-07-21T21:21:10Z)
GALA: Toward Geometry-and-Lighting-Aware Object Search for Compositing [43.14411954867784]
GALA is a generic foreground object search method with discriminative modeling on geometry and lighting compatibility. It generalizes well on large-scale open-world datasets, i.e. Pixabay and Open Images. In addition, our method can effectively handle non-box scenarios, where users only provide background images without any input bounding box.
arXiv Detail & Related papers (2022-03-31T22:36:08Z)
Complex Scene Image Editing by Scene Graph Comprehension [17.72638225034884]
We propose a two-stage method for achieving complex scene image editing by Scene Graph (SGC-Net) In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects. The second stage uses a conditional diffusion model to edit the image based on our RoI predictions.
arXiv Detail & Related papers (2022-03-24T05:12:54Z)
Unsupervised Layered Image Decomposition into Object Prototypes [39.20333694585477]
We present an unsupervised learning framework for decomposing images into layers of automatically discovered object models. We first validate our approach by providing results on par with the state of the art on standard multi-object synthetic benchmarks. We then demonstrate the applicability of our model to real images in tasks that include clustering (SVHN, GTSRB), cosegmentation (Weizmann Horse) and object discovery from unfiltered social network images.
arXiv Detail & Related papers (2021-04-29T18:02:01Z)
Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization. We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning. Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z)
Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. We combine the resulting representations with global scene information for accurately predicting visual saliency. Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.