GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts
- URL: http://arxiv.org/abs/2509.08818v1
- Date: Wed, 10 Sep 2025 17:51:42 GMT
- Title: GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts
- Authors: Jenna Kang, Maria Silva, Patsorn Sangkloy, Kenneth Chen, Niall Williams, Qi Sun,
- Abstract summary: GeneVA is a large-scale dataset with rich human annotations that focuses in on-temporal artifacts generated from natural text prompts.<n>We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.
- Score: 7.080098051830916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.
Related papers
- SemanticGen: Video Generation in Semantic Space [60.49729308406981]
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder.<n>We introduce SemanticGen, a novel solution to generate videos in the semantic space.<n>Our method is also effective and computationally efficient when extended to long video generation.
arXiv Detail & Related papers (2025-12-23T18:59:56Z) - SAGA: Source Attribution of Generative AI Videos [23.217701516122048]
We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the need for AI-generated video source attribution at a large scale.<n>It provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights.
arXiv Detail & Related papers (2025-11-16T23:39:54Z) - BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos [63.03271511550633]
BrokenVideos is a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption.<n>Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions.
arXiv Detail & Related papers (2025-06-25T03:30:04Z) - ASurvey: Spatiotemporal Consistency in Video Generation [72.82267240482874]
Video generation schemes by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC)<n>Recent works have aimed at addressing thetemporal consistency issue in video generation, while few literature review has been organized from this perspective.<n>We systematically review recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics.
arXiv Detail & Related papers (2025-02-25T05:20:51Z) - OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset.<n>Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos.<n>Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.