Related papers: Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling

Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling

URL: http://arxiv.org/abs/2601.15664v1
Date: Thu, 22 Jan 2026 05:23:20 GMT
Title: Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling
Authors: Hongyang Wei, Hongbo Liu, Zidong Wang, Yi Peng, Baixin Xu, Size Wu, Xuying Zhang, Xianglong He, Zexiang Liu, Peiyu Wang, Xuchen Song, Yangguang Li, Yang Liu, Yahui Zhou,
Abstract summary: We present Skywork UniPic 3.0, a unified framework that integrates single-image editing and multi-image composition.<n>To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline.<n>We introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem.
Score: 21.387568749211876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community's strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.

Related papers

UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing [33.64590153603506]
We present UniRef-Image-Edit, a high-performance multi-modal generation system.<n>It unifies single-image editing and multi-image composition within a single framework.
arXiv Detail & Related papers (2026-02-15T15:24:03Z)
A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models [12.36766048544934]
We introduce Any2all, a unified framework that addresses the limitations of existing methods.<n>We train a single, unconditional diffusion model on the complete multimodal data stack.<n>This model is adapted at inference time to inpaint'' all target modalities from any combination of inputs of available clean images or noisy measurements.<n>Our results show that Any2all can achieve excellent performance on both multimodal reconstruction and synthesis tasks.
arXiv Detail & Related papers (2026-02-09T03:54:24Z)
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards [86.1965460124838]
We propose a scalable multi-subject data generation pipeline.<n>We first enable single-subject personalization models to acquire knowledge of multi-image and multi-subject scenarios.<n>To enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards.
arXiv Detail & Related papers (2025-12-01T03:25:49Z)
FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching [42.22268167379098]
We formulate image fusion as a direct probabilistic transport from source modalities to the fused image distribution.<n>We employ a task-aware selection function to select the most reliable pseudo-labels for each task.<n>For multi-task scenarios, we integrate elastic weight consolidation and experience replay mechanisms to preserve cross-task performance.
arXiv Detail & Related papers (2025-11-17T02:56:48Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks.<n>Our approach enables versatile capabilities via different inference-time sampling schemes.<n>Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z)
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency [41.87857129429512]
We introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations.<n>This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length.<n>In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework.
arXiv Detail & Related papers (2024-10-10T04:14:52Z)
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models.<n>We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner.<n>IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z)
MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image. Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z)
Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images. We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z)
MoMo: A shared encoder Model for text, image and multi-Modal representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks. We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.