Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models
- URL: http://arxiv.org/abs/2405.04180v1
- Date: Tue, 7 May 2024 10:39:14 GMT
- Title: Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models
- Authors: Zhixuan Chu, Lei Zhang, Yichen Sun, Siqiao Xue, Zhibo Wang, Zhan Qin, Kui Ren,
- Abstract summary: We introduce a novel unified framework designed to detect hallucinations across diverse large text-to-video (T2V) models.
Our framework is built upon a comprehensive analysis of hallucination phenomena, categorizing them based on their manifestation in the video content.
SoraDetector provides a robust and quantifiable measure of consistency, static and dynamic hallucination.
- Score: 24.33545993881271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement in text-to-video (T2V) generative models has enabled the synthesis of high-fidelity video content guided by textual descriptions. Despite this significant progress, these models are often susceptible to hallucination, generating contents that contradict the input text, which poses a challenge to their reliability and practical deployment. To address this critical issue, we introduce the SoraDetector, a novel unified framework designed to detect hallucinations across diverse large T2V models, including the cutting-edge Sora model. Our framework is built upon a comprehensive analysis of hallucination phenomena, categorizing them based on their manifestation in the video content. Leveraging the state-of-the-art keyframe extraction techniques and multimodal large language models, SoraDetector first evaluates the consistency between extracted video content summary and textual prompts, then constructs static and dynamic knowledge graphs (KGs) from frames to detect hallucination both in single frames and across frames. Sora Detector provides a robust and quantifiable measure of consistency, static and dynamic hallucination. In addition, we have developed the Sora Detector Agent to automate the hallucination detection process and generate a complete video quality report for each input video. Lastly, we present a novel meta-evaluation benchmark, T2VHaluBench, meticulously crafted to facilitate the evaluation of advancements in T2V hallucination detection. Through extensive experiments on videos generated by Sora and other large T2V models, we demonstrate the efficacy of our approach in accurately detecting hallucinations. The code and dataset can be accessed via GitHub.
Related papers
- Simple Visual Artifact Detection in Sora-Generated Videos [9.991747596111011]
This study investigates visual artifacts frequently found and reported in Sora-generated videos.
We propose a multi-label classification framework targeting four common artifact label types.
The best-performing model trained by ResNet-50 achieved an average multi-label classification accuracy of 94.14%.
arXiv Detail & Related papers (2025-04-30T05:41:43Z) - Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.
In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models [13.04745908368858]
We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models.
Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos.
This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.
arXiv Detail & Related papers (2024-11-16T19:23:12Z) - Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - What Matters in Detecting AI-Generated Videos like Sora? [51.05034165599385]
Gap between synthetic and real-world videos remains under-explored.
In this study, we compare real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion.
Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training.
arXiv Detail & Related papers (2024-06-27T23:03:58Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text
and Image Inputs [53.21307319844615]
We present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework.
The framework includes two parts: prompt enhancer and full video translation.
arXiv Detail & Related papers (2024-03-10T16:09:02Z) - OnUVS: Online Feature Decoupling Framework for High-Fidelity Ultrasound
Video Synthesis [34.07625938756013]
Sonographers must observe corresponding dynamic anatomic structures to gather comprehensive information.
The synthesis of US videos may represent a promising solution to this issue.
We present a novel online feature-decoupling framework called OnUVS for high-fidelity US video synthesis.
arXiv Detail & Related papers (2023-08-16T10:16:50Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.