Related papers: Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

URL: http://arxiv.org/abs/2505.08622v2
Date: Mon, 21 Jul 2025 05:47:57 GMT
Title: Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Authors: Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim,
Abstract summary: Visually Guided Decoding (VGD) is a gradient-free approach that leverages large language models and CLIP-based guidance to generate coherent and semantically aligned prompts.<n>Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts.
Score: 24.203389418820123
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.

Related papers

Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization [75.88719716002014]
Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains.<n>Recent advances in pre-trained Visual Foundation Models (VFMs) have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models.<n>We propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM.
arXiv Detail & Related papers (2025-07-03T03:52:37Z)
Multimodal Prompt Alignment for Facial Expression Recognition [24.470095812039286]
MPA-FER provides fine-grained semantic guidance to the learning process of prompted visual features.<n>Our framework outperforms state-of-the-art methods on three FER benchmark datasets.
arXiv Detail & Related papers (2025-06-26T05:28:57Z)
EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [31.31018600797305]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [88.14234949860105]
RePrompt is a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning.<n>Our approach enables end-to-end training without human-annotated data.
arXiv Detail & Related papers (2025-05-23T06:44:26Z)
IPO: Interpretable Prompt Optimization for Vision-Language Models [40.83071220530289]
This paper introduces a simple but interpretable prompt (IPO) IPO utilizes large language models (LLMs) to generate textual prompts dynamically. We incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions.
arXiv Detail & Related papers (2024-10-20T14:10:22Z)
Coherent Zero-Shot Visual Instruction Generation [15.0521272616551]
This paper introduces a simple, training-free framework to tackle the issues of generating visual instructions. Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing. Our experiments show that our approach can visualize coherent and visually pleasing instructions.
arXiv Detail & Related papers (2024-06-06T17:59:44Z)
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts. We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z)
Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models. We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z)
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models [103.61066310897928]
Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. We introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness
arXiv Detail & Related papers (2023-01-31T18:10:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.