Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation
- URL: http://arxiv.org/abs/2506.01480v1
- Date: Mon, 02 Jun 2025 09:39:28 GMT
- Title: Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation
- Authors: Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, Yueting Zhuang,
- Abstract summary: We propose to enable the collaborative co-evolution of visual comprehension and generation.<n>We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT.<n>We unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation.
- Score: 85.22602924467603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.
Related papers
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.<n>Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.<n>We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z) - ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability [14.703591553247948]
ARMOR is a resource-efficient and pure autoregressive framework for multimodal large language models.<n>It achieves both understanding and generation by fine-tuning existing MLLMs.<n>We show that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources.
arXiv Detail & Related papers (2025-03-09T10:15:39Z) - VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model [38.61292051733335]
We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework.<n>VarGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation.<n> Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks.
arXiv Detail & Related papers (2025-01-21T17:50:43Z) - MetaMorph: Multimodal Understanding and Generation via Instruction Tuning [57.35160715164359]
Visual-Predictive Instruction Tuning (VPiT) is a simple and effective extension to visual instruction tuning.<n>VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data.<n>We train our MetaMorph model and achieve competitive performance on both visual understanding and generation.
arXiv Detail & Related papers (2024-12-18T18:58:50Z) - X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models [77.98981338798383]
In-context generation is a key component of large language models' (LLMs) open-task generalization capability.<n>X-Prompt is a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks.<n>A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples.
arXiv Detail & Related papers (2024-12-02T18:59:26Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.