Semantic Image Synthesis via Class-Adaptive Cross-Attention
- URL: http://arxiv.org/abs/2308.16071v3
- Date: Tue, 30 Jul 2024 13:09:47 GMT
- Title: Semantic Image Synthesis via Class-Adaptive Cross-Attention
- Authors: Tomaso Fontanini, Claudio Ferrari, Giuseppe Lisanti, Massimo Bertozzi, Andrea Prati,
- Abstract summary: Cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process.
Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality, as well as improved global and local style transfer.
- Score: 7.147779225315707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS.
Related papers
- Stable Flow: Vital Layers for Training-Free Image Editing [74.52248787189302]
Diffusion models have revolutionized the field of content synthesis and editing.
Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT)
We propose an automatic method to identify "vital layers" within DiT, crucial for image formation.
Next, to enable real-image editing, we introduce an improved image inversion method for flow models.
arXiv Detail & Related papers (2024-11-21T18:59:51Z) - LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model [70.14953942532621]
Layer-collaborative diffusion model, named LayerDiff, is designed for text-guided, multi-layered, composable image synthesis.
Our model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods.
LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
arXiv Detail & Related papers (2024-03-18T16:28:28Z) - Latents2Semantics: Leveraging the Latent Space of Generative Models for
Localized Style Manipulation of Face Images [25.82631308991067]
We introduce the Latents2Semantics Autoencoder (L2SAE), a Generative Autoencoder model that facilitates localized editing of style attributes of several Regions of Interest in face images.
The L2SAE learns separate latent representations for encoded images' structure and style information, allowing for structure-preserving style editing of the chosen ROIs.
We provide qualitative and quantitative results for the same over multiple applications, such as selective style editing and swapping using test images sampled from several datasets.
arXiv Detail & Related papers (2023-12-22T20:06:53Z) - Efficient and Explicit Modelling of Image Hierarchies for Image
Restoration [120.35246456398738]
We propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration.
Inspired by that, we propose the anchored stripe self-attention which achieves a good balance between the space and time complexity of self-attention.
Then we propose a new network architecture dubbed GRL to explicitly model image hierarchies in the Global, Regional, and Local range.
arXiv Detail & Related papers (2023-03-01T18:59:29Z) - Style-Hallucinated Dual Consistency Learning: A Unified Framework for
Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks.
Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z) - Hierarchical Semantic Regularization of Latent Spaces in StyleGANs [53.98170188547775]
We propose a Hierarchical Semantic Regularizer (HSR) which aligns the hierarchical representations learnt by the generator to corresponding powerful features learnt by pretrained networks on large amounts of data.
HSR is shown to not only improve generator representations but also the linearity and smoothness of the latent style spaces, leading to the generation of more natural-looking style-edited images.
arXiv Detail & Related papers (2022-08-07T16:23:33Z) - V-LinkNet: Learning Contextual Inpainting Across Latent Space of
Generative Adversarial Network [7.5089719291325325]
We propose the V-LinkNet cross-space learning strategy network to improve learning on contextualised features.
We compare inpainting performance on the same face with different masks and on different faces with the same masks.
Our result surpasses the state of the art when evaluated on the CelebA-HQ with the standard protocol.
arXiv Detail & Related papers (2022-01-02T09:14:23Z) - One-shot domain adaptation for semantic face editing of real world
images using StyleALAE [7.541747299649292]
styleALAE is a latent-space based autoencoder that can generate photo-realistic images of high quality.
Our work ensures that the identity of the reconstructed image is the same as the given input image.
We further generate semantic modifications over the reconstructed image by using the latent space of the pre-trained styleALAE model.
arXiv Detail & Related papers (2021-08-31T14:32:18Z) - Diversifying Semantic Image Synthesis and Editing via Class- and
Layer-wise VAEs [8.528384027684192]
We propose a class- and layer-wise extension to the variational autoencoder framework that allows flexible control over each object class at the local to global levels.
We demonstrate that our method generates images that are both plausible and more diverse compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-06-25T04:12:05Z) - Controllable Person Image Synthesis with Spatially-Adaptive Warped
Normalization [72.65828901909708]
Controllable person image generation aims to produce realistic human images with desirable attributes.
We introduce a novel Spatially-Adaptive Warped Normalization (SAWN), which integrates a learned flow-field to warp modulation parameters.
We propose a novel self-training part replacement strategy to refine the pretrained model for the texture-transfer task.
arXiv Detail & Related papers (2021-05-31T07:07:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.