Related papers: UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing

UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing

URL: http://arxiv.org/abs/2602.14186v1
Date: Sun, 15 Feb 2026 15:24:03 GMT
Title: UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing
Authors: Hongyang Wei, Bin Wen, Yancheng Long, Yankai Yang, Yuhang Hu, Tianke Zhang, Wei Chen, Haonan Fan, Kaiyu Jiang, Jiankang Chen, Changyi Liu, Kaiyu Tang, Haojie Ding, Xiao Yang, Jia Sun, Huaiqing Wang, Zhenyu Yang, Xinyu Wei, Xianglong He, Yangguang Li, Fan Yang, Tingting Gao, Lei Zhang, Guorui Zhou, Han Li,
Abstract summary: We present UniRef-Image-Edit, a high-performance multi-modal generation system.<n>It unifies single-image editing and multi-image composition within a single framework.
Score: 33.64590153603506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.

Related papers

Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling [21.387568749211876]
We present Skywork UniPic 3.0, a unified framework that integrates single-image editing and multi-image composition.<n>To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline.<n>We introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem.
arXiv Detail & Related papers (2026-01-22T05:23:20Z)
Towards Generalized Multi-Image Editing for Unified Multimodal Models [56.620038824933566]
Unified Multimodal Models (UMMs) integrate multimodal understanding and generation.<n>UMMs are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images.<n>We propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts.
arXiv Detail & Related papers (2026-01-09T06:42:49Z)
GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation [77.13582457917418]
We train a generative model solely on grid images comprising subsampled frames.<n>We learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames.<n>Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
arXiv Detail & Related papers (2025-12-24T16:46:04Z)
Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z)
Canvas-to-Image: Compositional Image Generation with Multimodal Controls [51.44122945214702]
We introduce Canvas-to-Image, a unified framework that consolidates heterogeneous controls into a single canvas interface.<n>Our key idea is to encode diverse control signals into a single composite canvas image that the model can interpret for integrated visual-spatial reasoning.
arXiv Detail & Related papers (2025-11-26T18:59:56Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer [32.9330637921386]
LAMIC is a Layout-Aware Multi-Image Composition framework.<n>It extends single-reference diffusion models to multi-reference scenarios in a training-free manner.<n>It consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings.
arXiv Detail & Related papers (2025-08-01T09:51:54Z)
MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models [30.494968865008513]
Recent text-to-image models struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex image generation.<n>We propose MENTOR, a novel framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation.<n>Our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods.
arXiv Detail & Related papers (2025-07-13T10:52:59Z)
Auto-Regressively Generating Multi-View Consistent Images [10.513203377236744]
We propose the Multi-View Auto-Regressive (textbfMV-AR) method to generate consistent multi-view images from arbitrary prompts.<n>When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information.<n>Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images.
arXiv Detail & Related papers (2025-06-23T11:28:37Z)
Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images. We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z)
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.