ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
- URL: http://arxiv.org/abs/2411.10867v1
- Date: Sat, 16 Nov 2024 19:23:12 GMT
- Title: ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
- Authors: Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das,
- Abstract summary: We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models.
Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos.
This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.
- Score: 13.04745908368858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.
Related papers
- Vidi: Large Multimodal Models for Video Understanding and Editing [33.56852569192024]
We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios.
The first release focuses on temporal retrieval, identifying the time ranges within the input videos corresponding to a given text query.
We also present the VUE-TR benchmark, which introduces five key advancements.
arXiv Detail & Related papers (2025-04-22T08:04:45Z) - HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation [99.6653979969241]
We introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos.
To guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs)
To obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy.
arXiv Detail & Related papers (2025-03-31T04:30:34Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models [24.33545993881271]
We introduce a novel unified framework designed to detect hallucinations across diverse large text-to-video (T2V) models.
Our framework is built upon a comprehensive analysis of hallucination phenomena, categorizing them based on their manifestation in the video content.
SoraDetector provides a robust and quantifiable measure of consistency, static and dynamic hallucination.
arXiv Detail & Related papers (2024-05-07T10:39:14Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - Learning Video Representations from Large Language Models [31.11998135196614]
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs)
We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators.
Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.