T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
- URL: http://arxiv.org/abs/2507.20536v2
- Date: Tue, 29 Jul 2025 06:16:18 GMT
- Title: T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
- Authors: Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi,
- Abstract summary: T2I-Copilot is a training-free multi-agent system that automates prompt phrasing, model selection, and iterative refinement.<n>On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%.
- Score: 33.57770029533121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI-Labs/T2I-Copilot.
Related papers
- Multi-modal Knowledge Distillation-based Human Trajectory Forecasting [35.060041571520024]
Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation.<n>In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy.<n>We propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities.
arXiv Detail & Related papers (2025-03-28T07:32:51Z) - X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation [7.61087111021017]
We propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities.<n>X2I shows a decrease in performance degradation of less than 1% while gaining various multimodal understanding abilities.
arXiv Detail & Related papers (2025-03-08T09:07:45Z) - IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models [52.73820275861131]
Text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation.<n>Recent models such as FLUX.1 and Ideogram2.0 have demonstrated exceptional performance across various complex tasks.<n>This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability.
arXiv Detail & Related papers (2025-01-23T18:58:33Z) - ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting [18.002126814513417]
ChatGen-Evo is a multi-stage evolution strategy that progressively equips models with essential automation skills.
ChatGen-Evo significantly enhances performance over various baselines.
arXiv Detail & Related papers (2024-11-26T07:31:12Z) - Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework [3.7953598825170753]
Kandinsky 3 is a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism.
We extend the base T2I model for various applications and create a multifunctional generation system.
Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
arXiv Detail & Related papers (2024-10-28T14:22:08Z) - Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [149.96612254604986]
PRISM is an algorithm that automatically produces human-interpretable and transferable prompts.<n>Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution.<n>Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models.
arXiv Detail & Related papers (2024-03-28T02:35:53Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation [115.63085345822175]
We introduce Idea to Image'', a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation.
We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities.
arXiv Detail & Related papers (2023-10-12T17:34:20Z) - GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z) - GenNI: Human-AI Collaboration for Data-Backed Text Generation [102.08127062293111]
Table2Text systems generate textual output based on structured data utilizing machine learning.
GenNI (Generation Negotiation Interface) is an interactive visual system for high-level human-AI collaboration in producing descriptive text.
arXiv Detail & Related papers (2021-10-19T18:07:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.