Related papers: Training-Free Style Consistent Image Synthesis with Condition and Mask Guidance in E-Commerce

Training-Free Style Consistent Image Synthesis with Condition and Mask Guidance in E-Commerce

URL: http://arxiv.org/abs/2409.04750v1
Date: Sat, 7 Sep 2024 07:50:13 GMT
Title: Training-Free Style Consistent Image Synthesis with Condition and Mask Guidance in E-Commerce
Authors: Guandong Li,
Abstract summary: We introduce the concept of the QKV level, referring to modifications in the attention maps (self-attention and cross-attention) when integrating UNet with image conditions. We use shared KV to enhance similarity in cross-attention and generate mask guidance from the attention map to cleverly direct the generation of style-consistent images.
Score: 13.67619785783182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating style-consistent images is a common task in the e-commerce field, and current methods are largely based on diffusion models, which have achieved excellent results. This paper introduces the concept of the QKV (query/key/value) level, referring to modifications in the attention maps (self-attention and cross-attention) when integrating UNet with image conditions. Without disrupting the product's main composition in e-commerce images, we aim to use a train-free method guided by pre-set conditions. This involves using shared KV to enhance similarity in cross-attention and generating mask guidance from the attention map to cleverly direct the generation of style-consistent images. Our method has shown promising results in practical applications.

Related papers

CDG-MAE: Learning Correspondences from Diffusion Generated Views [19.24402848656637]
CDG-MAE is a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images.<n>These generated views exhibit substantial changes in pose and perspective, providing a rich training signal.
arXiv Detail & Related papers (2025-06-22T20:40:11Z)
AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing [33.74477787349966]
We propose a novel one-step point-based image editing method, named AttentionDrag.<n>This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining.<n>Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds.
arXiv Detail & Related papers (2025-06-16T09:42:38Z)
Edicho: Consistent Image Editing in the Wild [90.42395533938915]
Edicho steps in with a training-free solution based on diffusion models. It features a fundamental design principle of using explicit image correspondence to direct editing.
arXiv Detail & Related papers (2024-12-30T16:56:44Z)
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control [8.685610154314459]
diffusion models show extraordinary talents in text-to-image generation, but they may still fail to generate highly aesthetic images. We propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method.
arXiv Detail & Related papers (2024-12-30T08:47:25Z)
Enhancing Conditional Image Generation with Explainable Latent Space Manipulation [0.0]
This paper proposes a novel approach to achieve fidelity to a reference image while adhering to conditional prompts. We analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features.
arXiv Detail & Related papers (2024-08-29T03:12:04Z)
ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z)
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition. Our method uses the attention mechanism to correlate multiple images within a batch. Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z)
Visual Concept-driven Image Generation with Text-to-Image Diffusion Model [65.96212844602866]
Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes. Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. We propose a concept-driven TTI personalization framework that addresses these core challenges.
arXiv Detail & Related papers (2024-02-18T07:28:37Z)
Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z)
A Unified Arbitrary Style Transfer Framework via Adaptive Contrastive Learning [84.8813842101747]
Unified Contrastive Arbitrary Style Transfer (UCAST) is a novel style representation learning and transfer framework. We present an adaptive contrastive learning scheme for style transfer by introducing an input-dependent temperature. Our framework consists of three key components, i.e., a parallel contrastive learning scheme for style representation and style transfer, a domain enhancement module for effective learning of style distribution, and a generative network for style transfer.
arXiv Detail & Related papers (2023-03-09T04:35:00Z)
Collaborative Image Understanding [5.5174379874002435]
We show that collaborative information can be leveraged to improve the classification process of new images. A series of experiments on datasets from e-commerce and social media demonstrates that considering collaborative signals helps to significantly improve the performance of the main task of image classification by up to 9.1%.
arXiv Detail & Related papers (2022-10-21T12:13:08Z)
SCS-Co: Self-Consistent Style Contrastive Learning for Image Harmonization [29.600429707123645]
We propose a self-consistent style contrastive learning scheme (SCS-Co) for image harmonization. By dynamically generating multiple negative samples, our SCS-Co can learn more distortion knowledge and well regularize the generated harmonized image. In addition, we propose a background-attentional adaptive instance normalization (BAIN) to achieve an attention-weighted background feature distribution.
arXiv Detail & Related papers (2022-04-29T09:22:01Z)
Co-Attention for Conditioned Image Matching [91.43244337264454]
We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material. While other approaches find correspondences between pairs of images by treating the images independently, we instead condition on both images to implicitly take account of the differences between them.
arXiv Detail & Related papers (2020-07-16T17:32:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.