Related papers: Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

URL: http://arxiv.org/abs/2403.19103v3
Date: Mon, 28 Apr 2025 03:04:46 GMT
Title: Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Authors: Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J. Zico Kolter,
Abstract summary: PRISM is an algorithm that automatically produces human-interpretable and transferable prompts.<n>Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution.<n>Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models.
Score: 149.96612254604986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically produces human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution built upon the reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.

Related papers

Test-time Prompt Refinement for Text-to-Image Models [14.505841027491114]
We introduce a test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR.<n>In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt.<n>We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.
arXiv Detail & Related papers (2025-07-22T20:30:13Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts. They struggle to support the consistent generation of identity-preserving requirements for storytelling. We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
RT-Attack: Jailbreaking Text-to-Image Models via Random Token [24.61198605177661]
We introduce a two-stage query-based black-box attack method utilizing random search. In the first stage, we establish a preliminary prompt by maximizing the semantic similarity between the adversarial and target harmful prompts. In the second stage, we use this initial prompt to refine our approach, creating a detailed adversarial prompt aimed at jailbreaking.
arXiv Detail & Related papers (2024-08-25T17:33:40Z)
Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models [59.16287352266203]
We introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method for text-to-image (T2I) diffusion models. APTP learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores.
arXiv Detail & Related papers (2024-06-17T19:22:04Z)
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions. Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I) We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z)
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos. We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models. Our method leverages a pretrained large language model for grounded generation in a novel two-stage process. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt. We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts. We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.