Related papers: PiCo: Enhancing Text-Image Alignment with Improved Noise Selection and Precise Mask Control in Diffusion Models

PiCo: Enhancing Text-Image Alignment with Improved Noise Selection and Precise Mask Control in Diffusion Models

URL: http://arxiv.org/abs/2505.03203v1
Date: Tue, 06 May 2025 05:38:13 GMT
Title: PiCo: Enhancing Text-Image Alignment with Improved Noise Selection and Precise Mask Control in Diffusion Models
Authors: Chang Xie, Chenyi Zhuang, Pan Gao,
Abstract summary: We propose PiCo (Pick-and-Control), a novel training-free approach with two key components to tackle these two factors.<n>First, we develop a noise selection module to assess the quality of the random noise and determine whether the noise is suitable for the target text.<n>Second, we introduce a referring mask module to generate pixel-level masks and to precisely modulate the cross-attention maps.
Score: 10.767325147254574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advanced diffusion models have made notable progress in text-to-image compositional generation. However, it is still a challenge for existing models to achieve text-image alignment when confronted with complex text prompts. In this work, we highlight two factors that affect this alignment: the quality of the randomly initialized noise and the reliability of the generated controlling mask. We then propose PiCo (Pick-and-Control), a novel training-free approach with two key components to tackle these two factors. First, we develop a noise selection module to assess the quality of the random noise and determine whether the noise is suitable for the target text. A fast sampling strategy is utilized to ensure efficiency in the noise selection stage. Second, we introduce a referring mask module to generate pixel-level masks and to precisely modulate the cross-attention maps. The referring mask is applied to the standard diffusion process to guide the reasonable interaction between text and image features. Extensive experiments have been conducted to verify the effectiveness of PiCo in liberating users from the tedious process of random generation and in enhancing the text-image alignment for diverse text descriptions.

Related papers

Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation [54.52905471078152]
We propose a mask-free talking face generation approach while maintaining the 2D-based face editing task.<n>We transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner.
arXiv Detail & Related papers (2025-07-28T16:03:36Z)
NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation [70.96827354717459]
diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation.<n>We propose a noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images.
arXiv Detail & Related papers (2025-05-18T05:09:47Z)
The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation [31.599902235859687]
We propose to leverage an aligned Gaussian noise as implicit guidance to complement explicit user-defined inputs, such as text prompts.<n>NoiseQuery enables fine-grained control and yields significant performance boosts over high-level semantics and over low-level visual attributes.
arXiv Detail & Related papers (2024-12-06T14:59:00Z)
DiffSTR: Controlled Diffusion Models for Scene Text Removal [5.790630195329777]
Scene Text Removal (STR) aims to prevent unauthorized use of text in images. STR faces several challenges, including boundary artifacts, inconsistent texture and color, and preserving correct shadows. We introduce a ControlNet diffusion model, treating STR as an inpainting task. We develop a mask pretraining pipeline to condition our diffusion model.
arXiv Detail & Related papers (2024-10-29T04:20:21Z)
AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person Retrieval [3.591122855617648]
noisy correspondence (NC) issue exists due to poor image quality and labeling errors.<n>Random masking augmentation may inadvertently discard critical semantic content.<n>Bidirectional Similarity Distribution Matching (BSDM) loss enables the model to effectively learn from positive pairs.<n>Weight Adjustment Focal (WAF) loss improves the model's ability to handle hard samples.
arXiv Detail & Related papers (2024-09-10T10:08:01Z)
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition [56.968108142307976]
We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) Our approach introduces canonical class-aware glyph masks to suppress background and text style noise. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
arXiv Detail & Related papers (2024-02-21T09:22:45Z)
Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework. It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z)
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. This paper breaks down the text-based video editing task into two stages. First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way. Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z)
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning. We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z)
Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models [52.29800567587504]
We propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information. TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts. We validate the efficacy of TCTS combined with Frequency Adaptive Sampling (FAS) with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality.
arXiv Detail & Related papers (2023-04-04T03:52:49Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality. Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.