Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image
- URL: http://arxiv.org/abs/2505.14341v1
- Date: Tue, 20 May 2025 13:27:52 GMT
- Title: Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image
- Authors: Sifan Li, Ming Tao, Hao Zhao, Ling Shao, Hao Tang,
- Abstract summary: We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP)<n>We design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images.<n>The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
- Score: 53.09546752700792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Image (T2I) has been prevalent in recent years, with most common condition tasks having been optimized nicely. Besides, counterfactual Text-to-Image is obstructing us from a more versatile AIGC experience. For those scenes that are impossible to happen in real world and anti-physics, we should spare no efforts in increasing the factual feel, which means synthesizing images that people think very likely to be happening, and concept alignment, which means all the required objects should be in the same frame. In this paper, we focus on concept alignment. As controllable T2I models have achieved satisfactory performance for real applications, we utilize this technology to replace the objects in a synthesized image in latent space step-by-step to change the image from a common scene to a counterfactual scene to meet the prompt. We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP), by using the newly SoTA language model DeepSeek to generate the instructions. Furthermore, to evaluate models' performance in counterfactual T2I, we design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images. The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
Related papers
- Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning [69.33115351856785]
We present a novel method, called T2I-PAL, to tackle the modality gap issue when using only text captions for PEFT.<n>The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions.<n>Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average.
arXiv Detail & Related papers (2025-06-12T11:09:49Z) - RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [88.14234949860105]
RePrompt is a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning.<n>Our approach enables end-to-end training without human-annotated data.
arXiv Detail & Related papers (2025-05-23T06:44:26Z) - TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models [19.1659725630146]
Training-Free Text-and-Image-to-Image (TF-TI2I) adapts cutting-edge T2I models without the need for additional training.<n>Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.
arXiv Detail & Related papers (2025-03-19T15:03:19Z) - End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings [5.217870815854702]
We study an approach to learning text embeddings specifically tailored to the Text-to-Image synthesis network.<n>We combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment.<n>A comprehensive set of experiments on three text-to-image benchmark datasets reveal that having two separate embeddings gives better results than using a shared one and that such an approach performs favourably in comparison with methods that use text representations from a pre-trained text encoder trained using a discriminative approach.
arXiv Detail & Related papers (2025-02-03T16:40:47Z) - TIPS: Text-Image Pretraining with Spatial awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.<n>We propose a novel general-purpose image-text model, which can be effectively used off the shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z) - Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [97.0899853256201]
We present a novel task and benchmark for evaluating the ability of text-to-image generation models to produce images that align with commonsense in real life.
We evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit"
We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos.
arXiv Detail & Related papers (2024-06-11T17:59:48Z) - Create Your World: Lifelong Text-to-Image Diffusion [75.14353789007902]
We propose Lifelong text-to-image Diffusion Model (L2DM) to overcome knowledge "catastrophic forgetting" for the past encountered concepts.
In respect of knowledge "catastrophic forgetting", our L2DM framework devises a task-aware memory enhancement module and a elastic-concept distillation module.
Our model can generate more faithful image across a range of continual text prompts in terms of both qualitative and quantitative metrics.
arXiv Detail & Related papers (2023-09-08T16:45:56Z) - Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion
Models [94.25020178662392]
Text-to-image (T2I) research has grown explosively in the past year.
One pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science.
In this paper, we take "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users.
arXiv Detail & Related papers (2023-05-25T16:30:07Z) - LeftRefill: Filling Right Canvas based on Left Reference through
Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images.
The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training.
Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.