Related papers: Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model

Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model

URL: http://arxiv.org/abs/2510.18573v1
Date: Tue, 21 Oct 2025 12:28:14 GMT
Title: Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
Authors: Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, Meng Wang,
Abstract summary: We present Kaleido, a subject-to-video(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects.<n>Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.
Score: 38.79676648965641
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.

Related papers

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards [86.1965460124838]
We propose a scalable multi-subject data generation pipeline.<n>We first enable single-subject personalization models to acquire knowledge of multi-image and multi-subject scenarios.<n>To enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards.
arXiv Detail & Related papers (2025-12-01T03:25:49Z)
CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance [47.59187786346473]
We present CountLoop, a training-free framework that provides diffusion models with accurate instance control.<n>Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98%.
arXiv Detail & Related papers (2025-08-18T11:28:02Z)
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation [4.832184187988317]
We propose a highly-consistent data synthesis pipeline to tackle subject-driven generation challenges.<n>This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data.<n>We also introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding.
arXiv Detail & Related papers (2025-04-02T22:20:21Z)
D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity.<n>In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator.<n>In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z)
Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities.<n>Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt.<n>Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z)
MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image. Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z)
Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models. In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques. We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z)
A Closer Look at Few-shot Image Generation [38.83570296616384]
When transferring pretrained GANs on small target data, the generator tends to replicate the training samples. Several methods have been proposed to address this few-shot image generation, but there is a lack of effort to analyze them under a unified framework. We propose a framework to analyze existing methods during the adaptation. Second contribution proposes to apply mutual information (MI) to retain the source domain's rich multi-level diversity information in the target domain generator.
arXiv Detail & Related papers (2022-05-08T07:46:26Z)
Generating Annotated High-Fidelity Images Containing Multiple Coherent Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information. We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.