Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset
- URL: http://arxiv.org/abs/2506.18851v1
- Date: Mon, 23 Jun 2025 17:11:56 GMT
- Title: Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset
- Authors: Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, Xinglong Wu,
- Abstract summary: We introduce textbfPhantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset<n>Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation.
- Score: 16.96968349836899
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.
Related papers
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation [14.141157176094737]
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions.<n>Existing I2V pipelines often suffer from appearance drift and geometric distortion.<n>We propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views.
arXiv Detail & Related papers (2026-02-10T18:59:51Z) - Your One-Stop Solution for AI-Generated Video Detection [26.581301251283943]
generative modeling can create remarkably realistic synthetic videos.<n>However, two key limitations hinder the development of this field.<n>We propose AIGVDBench, a benchmark designed to be comprehensive and representative.
arXiv Detail & Related papers (2026-01-16T07:02:06Z) - OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation [53.33087515226418]
We introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation.<n>The dataset is built with a four-stage pipeline that exploits cross-frame identity priors.<n>In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge.
arXiv Detail & Related papers (2025-12-09T06:49:33Z) - OmniPerson: Unified Identity-Preserving Pedestrian Generation [12.060261814704022]
We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for ReID tasks.<n>We present PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation.<n>We will open-source the full, pretrained model, and the PersonSyn dataset.
arXiv Detail & Related papers (2025-12-02T09:24:34Z) - PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards [86.1965460124838]
We propose a scalable multi-subject data generation pipeline.<n>We first enable single-subject personalization models to acquire knowledge of multi-image and multi-subject scenarios.<n>To enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards.
arXiv Detail & Related papers (2025-12-01T03:25:49Z) - From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts [69.44297222099175]
We introduce a Mixture of Facial Experts (MoFE) that captures distinct but mutually reinforcing aspects of facial attributes.<n>To mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency.<n>We have curated and refined a Large Face Angles (LFA) dataset from existing open-source human video datasets.
arXiv Detail & Related papers (2025-08-13T04:10:16Z) - Multimodal Referring Segmentation: A Survey [93.24051010753817]
Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.<n>Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models.
arXiv Detail & Related papers (2025-08-01T02:14:00Z) - Get In Video: Add Anything You Want to the Video [48.06070610416688]
Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage.<n>Current approaches fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions.<n>We introduce "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos.
arXiv Detail & Related papers (2025-03-08T16:27:53Z) - Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training [102.82553402539139]
Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image.<n>These models often face challenges in maintaining consistency across novel and reference views.<n>We propose to use epipolar geometry to locate and retrieve overlapping information from the input view.<n>This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning.
arXiv Detail & Related papers (2025-02-25T14:04:22Z) - Phantom: Subject-consistent video generation via cross-modal alignment [16.777805813950486]
We propose a unified video generation framework for both single- and multi-subject references.<n>The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.
arXiv Detail & Related papers (2025-02-16T11:02:50Z) - Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities.<n>Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt.<n>Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z) - Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags [28.368960723666458]
Multimodal Large Language Models (MLLMs) struggle with critical problems when required to provide a precise and detailed response to a visual instruction.
We show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data.
We propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information.
arXiv Detail & Related papers (2024-06-16T08:20:12Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization [25.56082131075747]
Large text-to-image models have revolutionized the ability to generate imagery using natural language.
This has led to interest in how to personalize a text-to-image model.
We introduce a novel regularization dataset generation strategy on both the text and image level.
arXiv Detail & Related papers (2023-11-07T19:41:19Z) - Weakly-supervised 3D Pose Transfer with Keypoints [57.66991032263699]
Main challenges of 3D pose transfer are: 1) Lack of paired training data with different characters performing the same pose; 2) Disentangling pose and shape information from the target mesh; 3) Difficulty in applying to meshes with different topologies.
We propose a novel weakly-supervised keypoint-based framework to overcome these difficulties.
arXiv Detail & Related papers (2023-07-25T12:40:24Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z) - Reference-Aided Part-Aligned Feature Disentangling for Video Person
Re-Identification [18.13546384207381]
We propose a textbfReference-textbfAided textbfPart-textbfAligned (textbfRAPA) framework to disentangle robust features of different parts.
By using both modules, the informative parts of pedestrian in videos are well aligned and more discriminative feature representation is generated.
arXiv Detail & Related papers (2021-03-21T06:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.