OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning
- URL: http://arxiv.org/abs/2508.21066v1
- Date: Thu, 28 Aug 2025 17:59:46 GMT
- Title: OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning
- Authors: Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, Xinglong Wu,
- Abstract summary: OneReward is a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks.<n>We develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning.
- Score: 26.133555631867385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io
Related papers
- UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing [59.590505989071175]
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts.<n>We introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights.
arXiv Detail & Related papers (2025-03-16T21:11:25Z) - RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models [22.042487298092883]
RealGeneral is a novel framework that reformulates image generation as a conditional frame prediction task.<n>It mitigates a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task.
arXiv Detail & Related papers (2025-03-13T14:31:52Z) - EditAR: Unified Conditional Generation with Autoregressive Models [58.093860528672735]
We propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks.<n>The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm.<n>We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
arXiv Detail & Related papers (2025-01-08T18:59:35Z) - MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.<n>We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.<n>To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.