ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
- URL: http://arxiv.org/abs/2406.18522v2
- Date: Tue, 01 Oct 2024 20:00:27 GMT
- Title: ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
- Authors: Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan,
- Abstract summary: ChronoMagic-Bench is a text-to-video (T2V) generation benchmark.
It focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence.
We conduct manual evaluations of ten representative T2V models, revealing their strengths and weaknesses.
We create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos.
- Score: 57.651809298512276
- License:
- Abstract: We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude. [Homepage](https://pku-yuangroup.github.io/ChronoMagic-Bench/).
Related papers
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [47.88160253507823]
We introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism.
CTGM incorporates the Temporal Information (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention.
arXiv Detail & Related papers (2024-08-15T14:47:44Z) - TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses.
To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z) - MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators [56.01348424615965]
We propose textbfMagicTime, a time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic generation.
Experiments demonstrate the superiority and effectiveness of MagicTime for generating high-quality and dynamic metamorphic videos.
arXiv Detail & Related papers (2024-04-07T16:49:07Z) - Real-time Online Video Detection with Temporal Smoothing Transformers [4.545986838009774]
A good streaming recognition model captures both long-term dynamics and short-term changes of video.
To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel.
We build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead.
arXiv Detail & Related papers (2022-09-19T17:59:02Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.