The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation
- URL: http://arxiv.org/abs/2407.12579v1
- Date: Wed, 17 Jul 2024 14:04:10 GMT
- Title: The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation
- Authors: Yi Yao, Chan-Feng Hsu, Jhe-Hao Lin, Hongxia Xie, Terence Lin, Yi-Ning Huang, Hong-Han Shuai, Wen-Huang Cheng,
- Abstract summary: This work explores how diffusion models can generate images from prompts requiring artistic creativity or specialized knowledge.
We introduce the Realistic-Fantasy Benchmark (RFBench), a novel evaluation framework blending realistic and fantastical scenarios.
Extensive human evaluations and GPT-based compositional assessments demonstrate our approach's superiority over state-of-the-art methods.
- Score: 26.221866701670546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In spite of recent advancements in text-to-image generation, limitations persist in handling complex and imaginative prompts due to the restricted diversity and complexity of training data. This work explores how diffusion models can generate images from prompts requiring artistic creativity or specialized knowledge. We introduce the Realistic-Fantasy Benchmark (RFBench), a novel evaluation framework blending realistic and fantastical scenarios. To address these challenges, we propose the Realistic-Fantasy Network (RFNet), a training-free approach integrating diffusion models with LLMs. Extensive human evaluations and GPT-based compositional assessments demonstrate our approach's superiority over state-of-the-art methods. Our code and dataset is available at https://leo81005.github.io/Reality-and-Fantasy/.
Related papers
- Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention [11.686174382596667]
Cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation.<n>We propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity.
arXiv Detail & Related papers (2025-06-30T17:41:25Z) - RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [88.14234949860105]
RePrompt is a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning.<n>Our approach enables end-to-end training without human-annotated data.
arXiv Detail & Related papers (2025-05-23T06:44:26Z) - GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art [38.40471808648207]
Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance.<n>We introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art.<n>We also propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs.
arXiv Detail & Related papers (2025-05-16T16:56:40Z) - Cross-Cultural Fashion Design via Interactive Large Language Models and Diffusion Models [0.0]
Fashion content generation is an emerging area at the intersection of artificial intelligence and creative design.
Existing methods struggle with cultural bias, limited scalability, and alignment between textual prompts and generated visuals.
We propose a novel framework that integrates Large Language Models (LLMs) with Latent Diffusion Models (LDMs) to address these challenges.
arXiv Detail & Related papers (2025-01-26T15:57:16Z) - TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions [0.562479170374811]
This paper proposes an approach to coalesce existing generative systems to form a stereoscopic virtual reality video from text.
Our work highlights the exciting possibilities of using natural language-driven graphics in fields like virtual reality simulations.
arXiv Detail & Related papers (2025-01-02T09:21:03Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models [3.7599363231894185]
We introduce a novel framework designed to produce consistent character representations from a single text prompt.
Our framework outperforms existing methods in generating characters with consistent visual identities.
arXiv Detail & Related papers (2024-06-04T23:39:08Z) - Diff-Mosaic: Augmenting Realistic Representations in Infrared Small Target Detection via Diffusion Prior [63.64088590653005]
We propose Diff-Mosaic, a data augmentation method based on the diffusion model.
We introduce an enhancement network called Pixel-Prior, which generates highly coordinated and realistic Mosaic images.
In the second stage, we propose an image enhancement strategy named Diff-Prior. This strategy utilizes diffusion priors to model images in the real-world scene.
arXiv Detail & Related papers (2024-06-02T06:23:05Z) - Closing the Visual Sim-to-Real Gap with Object-Composable NeRFs [59.12526668734703]
We introduce Composable Object Volume NeRF (COV-NeRF), an object-composable NeRF model that is the centerpiece of a real-to-sim pipeline.
COV-NeRF extracts objects from real images and composes them into new scenes, generating photorealistic renderings and many types of 2D and 3D supervision.
arXiv Detail & Related papers (2024-03-07T00:00:02Z) - RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models [42.20230095700904]
RealCompo is a new training-free and transferred-friendly text-to-image generation framework.
An intuitive and novel balancer is proposed to balance the strengths of the two models in denoising process.
Our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models.
arXiv Detail & Related papers (2024-02-20T10:56:52Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - ImaginaryNet: Learning Object Detectors without Real Images and
Annotations [66.30908705345973]
We propose a framework to synthesize images by combining pretrained language model and text-to-image model.
With the synthesized images and class labels, weakly supervised object detection can then be leveraged to accomplish Imaginary-Supervised Object Detection.
Experiments show that ImaginaryNet can (i) obtain about 70% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data.
arXiv Detail & Related papers (2022-10-13T10:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.