GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning
- URL: http://arxiv.org/abs/2601.18543v2
- Date: Wed, 28 Jan 2026 08:36:07 GMT
- Title: GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning
- Authors: Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, Wenqiang Zhang,
- Abstract summary: We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model.<n>GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ and WISE.<n>Our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks.
- Score: 54.42973725693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6\%) and WISE (+14\%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \href{https://github.com/deep-kaixun/GenAgent}{this url}.
Related papers
- AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent [57.10083973844841]
AgentArk is a novel framework to distill multi-agent dynamics into the weights of a single model.<n>We investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios.<n>By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents.
arXiv Detail & Related papers (2026-02-03T19:18:28Z) - Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback [51.22403664895878]
Agent2World is a tool-augmented multi-agent framework that achieves strong inference-time world-model generation.<n>It also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback.
arXiv Detail & Related papers (2025-12-26T18:54:14Z) - ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation [49.01601313084479]
ImAgent is a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation.<n>Experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone.
arXiv Detail & Related papers (2025-11-14T17:00:29Z) - Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling [46.593200463657645]
We present a comprehensive and fully open-source pipeline for training a high-performance agentic model, named Klear-Qwen3-AgentForge.<n>We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks.
arXiv Detail & Related papers (2025-11-08T09:47:27Z) - Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration [73.65102758687289]
This study introduces three innovations to improve multi-agent collaboration.<n>First, we propose OmniAgent, a hierarchical, graph-based multi-agent framework for long video generation.<n>Second, inspired by context engineering, we propose hypergraph nodes that enable temporary group discussions.
arXiv Detail & Related papers (2025-10-25T20:34:18Z) - Agent Lightning: Train ANY AI Agents with Reinforcement Learning [24.13422767414729]
We present Agent Lightning, a framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent.<n>By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module.<n>For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime.
arXiv Detail & Related papers (2025-08-05T17:50:13Z) - ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems [80.69865295743149]
This work attempts to study using LLM-based agents to design collaborative AI systems autonomously.<n>Based on ComfyBench, we develop ComfyAgent, a framework that empowers agents to autonomously design collaborative AI systems by generating.<n>While ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks.
arXiv Detail & Related papers (2024-09-02T17:44:10Z) - Learning Generative Models with Goal-conditioned Reinforcement Learning [0.0]
We present a novel framework for learning generative models with goal-conditioned reinforcement learning.
We empirically demonstrate that our method is able to generate diverse and high quality samples in the task of image synthesis.
arXiv Detail & Related papers (2023-03-26T20:33:44Z) - UPDeT: Universal Multi-agent Reinforcement Learning via Policy
Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks.
Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy.
The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.