Related papers: Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

URL: http://arxiv.org/abs/2412.09614v1
Date: Thu, 12 Dec 2024 18:59:41 GMT
Title: Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG
Authors: Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag,
Abstract summary: We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG.<n>Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images.
Score: 6.701537544179892
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. This capability significantly improves upon the limitations of existing T2I models, which often struggle with the accurate depiction of complex or culturally specific subjects due to dataset constraints. Furthermore, we propose a novel self-correcting mechanism for text-to-image models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. Our qualitative and quantitative experiments demonstrate that Context Canvas significantly enhances the capabilities of popular models such as Flux, Stable Diffusion, and DALL-E, and improves the functionality of ControlNet for fine-grained image editing tasks. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images.

Related papers

Qwen-Image Technical Report [86.46471547116158]
We present Qwen-Image, an image generation foundation model that achieves significant advances in complex text rendering and precise image editing.<n>We design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing.<n>Qwen-Image performs exceptionally well in alphabetic languages such as English, and also achieves remarkable progress on more challenging logographic languages like Chinese.
arXiv Detail & Related papers (2025-08-04T11:49:20Z)
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [66.73899356886652]
We build an image tokenizer directly atop pre-trained vision foundation models.<n>Our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.<n>It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks.
arXiv Detail & Related papers (2025-07-11T09:32:45Z)
Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning? [3.966028515034415]
This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as a valuable complementary modality for text-centric tasks.
arXiv Detail & Related papers (2025-06-21T07:32:09Z)
ImageSet2Text: Describing Sets of Images through Text [17.336422962134918]
We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an external knowledge graph and CLIP-based validation. We evaluate ImageSet2Text's descriptions on accuracy, completeness, readability and overall quality, benchmarking it against existing vision-language models and introducing new datasets for large-scale group image captioning.
arXiv Detail & Related papers (2025-03-25T05:29:50Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models. We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z)
FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation [19.65838242227773]
This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner. Our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band.
arXiv Detail & Related papers (2024-08-02T04:13:38Z)
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions. Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image. We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z)
ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models [52.23899502520261]
We introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically.<n>We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.<n>This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation.
arXiv Detail & Related papers (2024-06-17T19:31:24Z)
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z)
Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models. In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques. We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z)
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets. We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks. We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z)
Generating Faithful Text From a Knowledge Graph with Noisy Reference Text [26.6775578332187]
We develop a KG-to-text generation model that can generate faithful natural-language text from a given graph. Our framework incorporates two core ideas: Firstly, we utilize contrastive learning to enhance the model's ability to differentiate between faithful and hallucinated information in the text. Secondly, we empower the decoder to control the level of hallucination in the generated text by employing a controllable text generation technique.
arXiv Detail & Related papers (2023-08-12T07:12:45Z)
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.