ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2503.19312v1
- Date: Tue, 25 Mar 2025 03:18:46 GMT
- Title: ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
- Authors: Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, Lijuan Wang,
- Abstract summary: We study the problem of Text-to-Image In-Context Learning (T2I-ICL)<n>We propose a framework that incorporates a thought process called ImageGen-CoT prior to image generation.<n>We fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities.
- Score: 89.19449553099747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code and model weights will be open-sourced.
Related papers
- GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis [10.47359822447001]
We present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps.
Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models.
arXiv Detail & Related papers (2024-12-08T22:29:56Z) - In-Context LoRA for Diffusion Transformers [49.288489286276146]
We show that text-to-image DiTs can effectively perform in-context generation without any tuning.
We name our models In-Context LoRA (IC-LoRA)
Our pipeline generates high-fidelity image sets that better adhere to prompts.
arXiv Detail & Related papers (2024-10-31T09:45:00Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Multi-modal Generation via Cross-Modal In-Context Learning [50.45304937804883]
We propose a Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that generates novel images from complex multimodal prompt sequences.
Our MGCC demonstrates a diverse range of multimodal capabilities, like novel image generation, the facilitation of multimodal dialogue, and generation of texts.
arXiv Detail & Related papers (2024-05-28T15:58:31Z) - L-Verse: Bidirectional Generation Between Image and Text [41.133824156046394]
L-Verse is a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART)
Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild.
L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks.
arXiv Detail & Related papers (2021-11-22T11:48:26Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.