PoemTale Diffusion: Minimising Information Loss in Poem to Image Generation with Multi-Stage Prompt Refinement
- URL: http://arxiv.org/abs/2507.13708v2
- Date: Wed, 23 Jul 2025 13:00:06 GMT
- Title: PoemTale Diffusion: Minimising Information Loss in Poem to Image Generation with Multi-Stage Prompt Refinement
- Authors: Sofia Jamil, Bollampalli Areen Reddy, Raghvendra Kumar, Sriparna Saha, Koustava Goswami, K. J. Joseph,
- Abstract summary: PoemTale Diffusion aims to minimise the information that is lost during poetic text-to-image conversion.<n>To support this, we adapt existing state-of-the-art diffusion models by modifying their self-attention mechanisms.<n>To encourage research in the field of poetry, we introduce the P4I dataset, consisting of 1111 poems.
- Score: 18.293592213622183
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in text-to-image diffusion models have achieved remarkable success in generating realistic and diverse visual content. A critical factor in this process is the model's ability to accurately interpret textual prompts. However, these models often struggle with creative expressions, particularly those involving complex, abstract, or highly descriptive language. In this work, we introduce a novel training-free approach tailored to improve image generation for a unique form of creative language: poetic verse, which frequently features layered, abstract, and dual meanings. Our proposed PoemTale Diffusion approach aims to minimise the information that is lost during poetic text-to-image conversion by integrating a multi stage prompt refinement loop into Language Models to enhance the interpretability of poetic texts. To support this, we adapt existing state-of-the-art diffusion models by modifying their self-attention mechanisms with a consistent self-attention technique to generate multiple consistent images, which are then collectively used to convey the poem's meaning. Moreover, to encourage research in the field of poetry, we introduce the P4I (PoemForImage) dataset, consisting of 1111 poems sourced from multiple online and offline resources. We engaged a panel of poetry experts for qualitative assessments. The results from both human and quantitative evaluations validate the efficacy of our method and contribute a novel perspective to poem-to-image generation with enhanced information capture in the generated images.
Related papers
- RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [88.14234949860105]
RePrompt is a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning.<n>Our approach enables end-to-end training without human-annotated data.
arXiv Detail & Related papers (2025-05-23T06:44:26Z) - Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models [18.293592213622183]
We propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems.<n>Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content.<n>To expand the diversity of the poetry dataset, we introduce MiniPo, a novel multimodal dataset comprising 1001 children's poems and images.
arXiv Detail & Related papers (2025-01-10T10:26:54Z) - Semi-supervised Chinese Poem-to-Painting Generation via Cycle-consistent Adversarial Networks [2.250406890348191]
We propose a semi-supervised approach using cycle-consistent adversarial networks to leverage the limited paired data.
We introduce novel evaluation metrics to assess the quality, diversity, and consistency of the generated poems and paintings.
The proposed model outperforms previous methods, showing promise in capturing the symbolic essence of artistic expression.
arXiv Detail & Related papers (2024-10-25T04:57:44Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models [52.23899502520261]
We introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically.<n>We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.<n>This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image
Diffusion Models [103.61066310897928]
Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt.
While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt.
We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt.
We introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness
arXiv Detail & Related papers (2023-01-31T18:10:38Z) - Language Does More Than Describe: On The Lack Of Figurative Speech in
Text-To-Image Models [63.545146807810305]
Text-to-image diffusion models can generate high-quality pictures from textual input prompts.
These models have been trained using text data collected from content-based labelling protocols.
We characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models.
arXiv Detail & Related papers (2022-10-19T14:20:05Z) - Generating Chinese Poetry from Images via Concrete and Abstract
Information [23.690384629376005]
We propose an infilling-based Chinese poetry generation model which can infill the Concrete keywords into each line of poems in an explicit way.
We also use non-parallel data during training and construct separate image datasets and poem datasets to train the different components in our framework.
Both automatic and human evaluation results show that our approach can generate poems which have better consistency with images without losing the quality.
arXiv Detail & Related papers (2020-03-24T11:17:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.