UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
- URL: http://arxiv.org/abs/2505.14682v1
- Date: Tue, 20 May 2025 17:59:26 GMT
- Title: UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
- Authors: Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan,
- Abstract summary: We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation.<n>We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization.<n>We propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality.
- Score: 52.12029029338604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.
Related papers
- Reinforcing Multimodal Understanding and Generation with Dual Self-rewards [56.08202047680044]
Large language models (LLMs) unify cross-model understanding and generation into a single framework.<n>Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks.<n>We introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z) - Policy Optimized Text-to-Image Pipeline Design [72.87655664038617]
We introduce a novel reinforcement learning-based framework for text-to-image generation.<n>Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations.<n>We then implement a two-phase training strategy: initial vocabulary training followed by GRPO-based optimization.
arXiv Detail & Related papers (2025-05-27T17:50:47Z) - UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning [24.792798238358717]
UGen is a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously.<n>To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning.
arXiv Detail & Related papers (2025-03-27T06:19:29Z) - ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning [89.19449553099747]
We study the problem of Text-to-Image In-Context Learning (T2I-ICL)<n>We propose a framework that incorporates a thought process called ImageGen-CoT prior to image generation.<n>We fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities.
arXiv Detail & Related papers (2025-03-25T03:18:46Z) - HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning [2.677995462843075]
HybridGen is an automated framework that integrates Vision-Language Model and hybrid planning.<n>It generates a large volume of training data without requiring specific data formats.<n>In the most challenging task variants, HybridGen achieves significant improvement, reaching a 59.7% average success rate.
arXiv Detail & Related papers (2025-03-17T13:49:43Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models [77.98981338798383]
In-context generation is a key component of large language models' (LLMs) open-task generalization capability.<n>X-Prompt is a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks.<n>A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples.
arXiv Detail & Related papers (2024-12-02T18:59:26Z) - UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models [88.16197692794707]
UniGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.
To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature.
Extensive experiments demonstrate the superior quality of data generated by UniGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z) - TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual
Vision Transformer for Fast Arbitrary One-Shot Image Generation [11.207512995742999]
One-shot image generation (OSG) with generative adversarial networks that learn from the internal patches of a given image has attracted world wide attention.
We propose a novel structure-preserved method TcGAN with individual vision transformer to overcome the shortcomings of the existing one-shot image generation methods.
arXiv Detail & Related papers (2023-02-16T03:05:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.