Related papers: Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

URL: http://arxiv.org/abs/2510.10633v1
Date: Sun, 12 Oct 2025 14:29:32 GMT
Title: Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion
Authors: Jiabao Shi, Minfeng Qi, Lefeng Zhang, Di Wang, Yingjie Zhao, Ziying Li, Yalong Xing, Ningran Li,
Abstract summary: Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail.<n>We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents within two coupled subsystems.<n>Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity.
Score: 5.999912771209971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.

Related papers

Unified Text-Image Generation with Weakness-Targeted Post-Training [57.956648078400775]
Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis.<n>This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis.
arXiv Detail & Related papers (2026-01-07T19:19:44Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization Framework [0.0]
Domain-specific image generation aims to produce high-quality visual content for specialized fields.<n>Current approaches overlook the inherent dependence between semantic understanding and visual representation in specialized domains.<n>We propose AdaptaGen, a hierarchical semantic optimization framework that integrates matrix-based prompt optimization with multi-perspective understanding.
arXiv Detail & Related papers (2025-07-08T03:04:08Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation [15.644911934279309]
Diffusion models have shown excellent performance in text-to-image generation.<n>We propose a Multi-agent Collaboration-based Compositional Diffusion for text-to-image generation for complex scenes.
arXiv Detail & Related papers (2025-05-05T13:50:03Z)
Generating Multimodal Images with GAN: Integrating Text, Image, and Style [7.481665175881685]
We propose a multimodal image generation method based on Generative Adversarial Networks (GAN)<n>This method involves the design of a text encoder, an image feature extractor, and a style integration module.<n> Experimental results show that our method produces images with high clarity and consistency across multiple public datasets.
arXiv Detail & Related papers (2025-01-04T02:51:28Z)
Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation.<n>ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific.<n>In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z)
Advanced Multimodal Deep Learning Architecture for Image-Text Matching [33.8315200009152]
Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship. We introduce an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding. Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets.
arXiv Detail & Related papers (2024-06-13T08:32:24Z)
High-Quality Pluralistic Image Completion via Code Shared VQGAN [51.7805154545948]
We present a novel framework for pluralistic image completion that can achieve both high quality and diversity at much faster inference speed. Our framework is able to learn semantically-rich discrete codes efficiently and robustly, resulting in much better image reconstruction quality.
arXiv Detail & Related papers (2022-04-05T01:47:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.