Related papers: Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

URL: http://arxiv.org/abs/2507.02316v1
Date: Thu, 03 Jul 2025 05:01:46 GMT
Title: Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos
Authors: Zecheng Zhao, Selena Song, Tong Chen, Zhi Chen, Shazia Sadiq, Yadan Luo,
Abstract summary: We introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models.<n>We generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions.<n>Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream text-to-video retrieval performance.
Score: 16.36132851725219
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.

Related papers

Pre-training for Action Recognition with Automatically Generated Fractal Datasets [23.686476742398973]
We present methods to automatically produce large-scale datasets of short synthetic video clips. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets.
arXiv Detail & Related papers (2024-11-26T16:51:11Z)
Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification [5.468979600421325]
We introduce NeuS-V, a novel synthetic video evaluation metric.<n>NeuS-V rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques.<n>We find that NeuS-V demonstrates a higher correlation by over 5x with human evaluations when compared to existing metrics.
arXiv Detail & Related papers (2024-11-22T23:59:12Z)
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens. DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z)
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.<n>We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore) This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z)
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI [1.1035305628305816]
This paper introduces AIGCBench, a pioneering comprehensive benchmark designed to evaluate a variety of video generation tasks. A varied and open-domain image-text dataset that evaluates different state-of-the-art algorithms under equivalent conditions. We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models.
arXiv Detail & Related papers (2024-01-03T10:08:40Z)
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z)
DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips. The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z)
Video compression dataset and benchmark of learning-based video-quality metrics [55.41644538483948]
We present a new benchmark for video-quality metrics that evaluates video compression. It is based on a new dataset consisting of about 2,500 streams encoded using different standards. Subjective scores were collected using crowdsourced pairwise comparisons.
arXiv Detail & Related papers (2022-11-22T09:22:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.