SAC-GAN: Structure-Aware Image-to-Image Composition for Self-Driving
- URL: http://arxiv.org/abs/2112.06596v1
- Date: Mon, 13 Dec 2021 12:24:50 GMT
- Title: SAC-GAN: Structure-Aware Image-to-Image Composition for Self-Driving
- Authors: Hang Zhou, Ali Mahdavi-Amiri, Rui Ma, Hao Zhang
- Abstract summary: We present a compositional approach to image augmentation for self-driving applications.
It is an end-to-end neural network that is trained to seamlessly compose an object represented as a cropped patch from an object image, into a background scene image.
We evaluate our network, coined SAC-GAN for structure-aware composition, on prominent self-driving datasets in terms of quality, composability, and generalizability of the composite images.
- Score: 18.842432515507035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a compositional approach to image augmentation for self-driving
applications. It is an end-to-end neural network that is trained to seamlessly
compose an object (e.g., a vehicle or pedestrian) represented as a cropped
patch from an object image, into a background scene image. As our approach
emphasizes more on semantic and structural coherence of the composed images,
rather than their pixel-level RGB accuracies, we tailor the input and output of
our network with structure-aware features and design our network losses
accordingly. Specifically, our network takes the semantic layout features from
the input scene image, features encoded from the edges and silhouette in the
input object patch, as well as a latent code as inputs, and generates a 2D
spatial affine transform defining the translation and scaling of the object
patch. The learned parameters are further fed into a differentiable spatial
transformer network to transform the object patch into the target image, where
our model is trained adversarially using an affine transform discriminator and
a layout discriminator. We evaluate our network, coined SAC-GAN for
structure-aware composition, on prominent self-driving datasets in terms of
quality, composability, and generalizability of the composite images.
Comparisons are made to state-of-the-art alternatives, confirming superiority
of our method.
Related papers
- SyntStereo2Real: Edge-Aware GAN for Remote Sensing Image-to-Image Translation while Maintaining Stereo Constraint [1.8749305679160366]
Current methods involve combining two networks, an unpaired image-to-image translation network and a stereo-matching network.
We propose an edge-aware GAN-based network that effectively tackles both tasks simultaneously.
We demonstrate that our model produces qualitatively and quantitatively superior results than existing models, and its applicability extends to diverse domains.
arXiv Detail & Related papers (2024-04-14T14:58:52Z) - Disentangling Structure and Appearance in ViT Feature Space [26.233355454282446]
We present a method for semantically transferring the visual appearance of one natural image to another.
Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image.
We propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain.
arXiv Detail & Related papers (2023-11-20T21:20:15Z) - Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Spectral Normalization and Dual Contrastive Regularization for
Image-to-Image Translation [9.029227024451506]
We propose a new unpaired I2I translation framework based on dual contrastive regularization and spectral normalization.
We conduct comprehensive experiments to evaluate the effectiveness of SN-DCR, and the results prove that our method achieves SOTA in multiple tasks.
arXiv Detail & Related papers (2023-04-22T05:22:24Z) - Neural Congealing: Aligning Images to a Joint Semantic Atlas [14.348512536556413]
We present a zero-shot self-supervised framework for aligning semantically-common content across a set of images.
Our approach harnesses the power of pre-trained DINO-ViT features to learn.
We show that our method performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.
arXiv Detail & Related papers (2023-02-08T09:26:22Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - Compositional Sketch Search [91.84489055347585]
We present an algorithm for searching image collections using free-hand sketches.
We exploit drawings as a concise and intuitive representation for specifying entire scene compositions.
arXiv Detail & Related papers (2021-06-15T09:38:09Z) - Deep Consensus Learning [16.834584070973676]
This paper proposes deep consensus learning for layout-to-image synthesis and weakly-supervised image semantic segmentation.
Two deep consensus mappings are exploited to facilitate training the three networks end-to-end.
It obtains compelling layout-to-image synthesis results and weakly-supervised image semantic segmentation results.
arXiv Detail & Related papers (2021-03-15T15:51:14Z) - Category Level Object Pose Estimation via Neural Analysis-by-Synthesis [64.14028598360741]
In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module.
The image synthesis network is designed to efficiently span the pose configuration space.
We experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone.
arXiv Detail & Related papers (2020-08-18T20:30:47Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z) - Structural-analogy from a Single Image Pair [118.61885732829117]
In this paper, we explore the capabilities of neural networks to understand image structure given only a single pair of images, A and B.
We generate an image that keeps the appearance and style of B, but has a structural arrangement that corresponds to A.
Our method can be used to generate high quality imagery in other conditional generation tasks utilizing images A and B only.
arXiv Detail & Related papers (2020-04-05T14:51:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.