Related papers: Patch-enhanced Mask Encoder Prompt Image Generation

Patch-enhanced Mask Encoder Prompt Image Generation

URL: http://arxiv.org/abs/2405.19085v1
Date: Wed, 29 May 2024 13:47:32 GMT
Title: Patch-enhanced Mask Encoder Prompt Image Generation
Authors: Shusong Xu, Peiye Liu,
Abstract summary: We propose a patch-enhanced mask approach to ensure accurate product descriptions. Our approach consists of three components Patch Flexible Visibility, Mask Prompt Adapter and an image Foundation Model. Experimental results show our method can achieve the highest visual results and FID scores compared with other methods.
Score: 0.8747606955991707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Artificial Intelligence Generated Content(AIGC), known for its superior visual results, represents a promising mitigation method for high-cost advertising applications. Numerous approaches have been developed to manipulate generated content under different conditions. However, a crucial limitation lies in the accurate description of products in advertising applications. Applying previous methods directly may lead to considerable distortion and deformation of advertised products, primarily due to oversimplified content control conditions. Hence, in this work, we propose a patch-enhanced mask encoder approach to ensure accurate product descriptions while preserving diverse backgrounds. Our approach consists of three components Patch Flexible Visibility, Mask Encoder Prompt Adapter and an image Foundation Model. Patch Flexible Visibility is used for generating a more reasonable background image. Mask Encoder Prompt Adapter enables region-controlled fusion. We also conduct an analysis of the structure and operational mechanisms of the Generation Module. Experimental results show our method can achieve the highest visual results and FID scores compared with other methods.

Related papers

Towards Fine-grained Interactive Segmentation in Images and Videos [21.22536962888316]
We present an SAM2Refiner framework built upon the SAM2 backbone. This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos. In addition, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder.
arXiv Detail & Related papers (2025-02-12T06:38:18Z)
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task. It captures high-order correlations of forged images from diverse linguistic feature spaces. It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z)
FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models [16.737419222106308]
We propose the explainable image forgery detection and localization (IFDL) task and design FakeShield. FakeShield is a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. In experiments, FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.
arXiv Detail & Related papers (2024-10-03T17:59:34Z)
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z)
DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation [14.725538019917625]
Diffusion probabilistic models (DPMs) have shown remarkable results on various image synthesis tasks. DPMs lack a low-dimensional, interpretable, and well-decoupled latent code. We propose Diff-AE to explore the potential of DPMs for representation learning via autoencoding.
arXiv Detail & Related papers (2023-07-12T04:11:08Z)
Conditioning Diffusion Models via Attributes and Semantic Masks for Face Generation [1.104121146441257]
Deep generative models have shown impressive results in generating realistic images of faces. GANs managed to generate high-quality, high-fidelity images when conditioned on semantic masks, but they still lack the ability to diversify their output. We propose a multi-conditioning approach for diffusion models via cross-attention exploiting both attributes and semantic masks to generate high-quality and controllable face images.
arXiv Detail & Related papers (2023-06-01T17:16:37Z)
Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data. We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z)
MaskSketch: Unpaired Structure-guided Masked Image Generation [56.88038469743742]
MaskSketch is an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure.
arXiv Detail & Related papers (2023-02-10T20:27:02Z)
MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions. Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z)
MagGAN: High-Resolution Face Attribute Editing with Mask-Guided Generative Adversarial Network [145.4591079418917]
MagGAN learns to only edit the facial parts that are relevant to the desired attribute changes. A novel mask-guided conditioning strategy is introduced to incorporate the influence region of each attribute change into the generator. A multi-level patch-wise discriminator structure is proposed to scale our model for high-resolution ($1024 times 1024$) face editing.
arXiv Detail & Related papers (2020-10-03T20:56:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.