OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
- URL: http://arxiv.org/abs/2505.20292v4
- Date: Tue, 03 Jun 2025 10:11:00 GMT
- Title: OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
- Authors: Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, Li Yuan,
- Abstract summary: We propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset.<n>OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity.<n>We create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples.
- Score: 50.12101313858712
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 18 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.
Related papers
- CI-VID: A Coherent Interleaved Text-Video Dataset [23.93099552431937]
CI-VID is a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation.<n>It contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions.<n>We show that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences.
arXiv Detail & Related papers (2025-07-02T17:48:01Z) - LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation [46.994391428519776]
We present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation.<n>We propose LOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy in terms of both instance level and model level.
arXiv Detail & Related papers (2025-05-17T17:49:26Z) - T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos [9.742383920787413]
T2VEval is a multi-branch fusion scheme for text-to-video quality evaluation.<n>It assesses videos across three branches: text-video consistency, realness, and technical quality.<n>T2VEval achieves state-of-the-art performance across multiple metrics.
arXiv Detail & Related papers (2025-01-15T03:11:33Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation [33.62365864717086]
We introduce OpenVid-1M, a precise high-quality dataset with expressive captions.<n>We also curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation.
arXiv Detail & Related papers (2024-07-02T15:40:29Z) - V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning [76.26890864487933]
Video summarization aims to create short, accurate, and cohesive summaries of longer videos.
Most existing datasets are created for video-to-video summarization.
Recent efforts have been made to expand from unimodal to multimodal video summarization.
arXiv Detail & Related papers (2024-04-18T17:32:46Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.