Modeling Image Composition for Complex Scene Generation
- URL: http://arxiv.org/abs/2206.00923v1
- Date: Thu, 2 Jun 2022 08:34:25 GMT
- Title: Modeling Image Composition for Complex Scene Generation
- Authors: Zuopeng Yang, Daqing Liu, Chaoyue Wang, Jie Yang, Dacheng Tao
- Abstract summary: We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
- Score: 77.10533862854706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method that achieves state-of-the-art results on challenging
(few-shot) layout-to-image generation tasks by accurately modeling textures,
structures and relationships contained in a complex scene. After compressing
RGB images into patch tokens, we propose the Transformer with Focal Attention
(TwFA) for exploring dependencies of object-to-object, object-to-patch and
patch-to-patch. Compared to existing CNN-based and Transformer-based generation
models that entangled modeling on pixel-level&patch-level and
object-level&patch-level respectively, the proposed focal attention predicts
the current patch token by only focusing on its highly-related tokens that
specified by the spatial layout, thereby achieving disambiguation during
training. Furthermore, the proposed TwFA largely increases the data efficiency
during training, therefore we propose the first few-shot complex scene
generation strategy based on the well-trained TwFA. Comprehensive experiments
show the superiority of our method, which significantly increases both
quantitative metrics and qualitative visual realism with respect to
state-of-the-art CNN-based and transformer-based methods. Code is available at
https://github.com/JohnDreamer/TwFA.
Related papers
- ObjBlur: A Curriculum Learning Approach With Progressive Object-Level Blurring for Improved Layout-to-Image Generation [7.645341879105626]
We present Blur, a novel curriculum learning approach to improve layout-to-image generation models.
Our method is based on progressive object-level blurring, which effectively stabilizes training and enhances the quality of generated images.
arXiv Detail & Related papers (2024-04-11T08:50:12Z) - Fiducial Focus Augmentation for Facial Landmark Detection [4.433764381081446]
We propose a novel image augmentation technique to enhance the model's understanding of facial structures.
We employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss.
Our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
arXiv Detail & Related papers (2024-02-23T01:34:00Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - Controllable Person Image Synthesis with Spatially-Adaptive Warped
Normalization [72.65828901909708]
Controllable person image generation aims to produce realistic human images with desirable attributes.
We introduce a novel Spatially-Adaptive Warped Normalization (SAWN), which integrates a learned flow-field to warp modulation parameters.
We propose a novel self-training part replacement strategy to refine the pretrained model for the texture-transfer task.
arXiv Detail & Related papers (2021-05-31T07:07:44Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Foreground-aware Semantic Representations for Image Harmonization [5.156484100374058]
We propose a novel architecture to utilize the space of high-level features learned by a pre-trained classification network.
We extensively evaluate the proposed method on existing image harmonization benchmark and set up a new state-of-the-art in terms of MSE and PSNR metrics.
arXiv Detail & Related papers (2020-06-01T09:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.