Simple Visual Artifact Detection in Sora-Generated Videos
- URL: http://arxiv.org/abs/2504.21334v1
- Date: Wed, 30 Apr 2025 05:41:43 GMT
- Title: Simple Visual Artifact Detection in Sora-Generated Videos
- Authors: Misora Sugiyama, Hirokatsu Kataoka,
- Abstract summary: This study investigates visual artifacts frequently found and reported in Sora-generated videos.<n>We propose a multi-label classification framework targeting four common artifact label types.<n>The best-performing model trained by ResNet-50 achieved an average multi-label classification accuracy of 94.14%.
- Score: 9.991747596111011
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The December 2024 release of OpenAI's Sora, a powerful video generation model driven by natural language prompts, highlights a growing convergence between large language models (LLMs) and video synthesis. As these multimodal systems evolve into video-enabled LLMs (VidLLMs), capable of interpreting, generating, and interacting with visual content, understanding their limitations and ensuring their safe deployment becomes essential. This study investigates visual artifacts frequently found and reported in Sora-generated videos, which can compromise quality, mislead viewers, or propagate disinformation. We propose a multi-label classification framework targeting four common artifact label types: label 1: boundary / edge defects, label 2: texture / noise issues, label 3: movement / joint anomalies, and label 4: object mismatches / disappearances. Using a dataset of 300 manually annotated frames extracted from 15 Sora-generated videos, we trained multiple 2D CNN architectures (ResNet-50, EfficientNet-B3 / B4, ViT-Base). The best-performing model trained by ResNet-50 achieved an average multi-label classification accuracy of 94.14%. This work supports the broader development of VidLLMs by contributing to (1) the creation of datasets for video quality evaluation, (2) interpretable artifact-based analysis beyond language metrics, and (3) the identification of visual risks relevant to factuality and safety.
Related papers
- Enhancing Multi-Modal Video Sentiment Classification Through Semi-Supervised Clustering [0.0]
We aim to improve video sentiment classification by focusing on two key aspects: the video itself, the accompanying text, and the acoustic features.<n>We are developing a method that utilizes clustering-based semi-supervised pre-training to extract meaningful representations from the data.
arXiv Detail & Related papers (2025-01-11T08:04:39Z) - Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features [21.583246378475856]
We introduce an extensive video dataset designed specifically for AI-Generated Video Detection (GenVidDet)
We also present the Dual-Branch 3D Transformer (DuB3D), an innovative and effective method for distinguishing between real and generated videos.
DuB3D can distinguish between real and generated video content with 96.77% accuracy, and strong generalization capability even for unseen types.
arXiv Detail & Related papers (2024-05-24T08:26:04Z) - Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models [24.33545993881271]
We introduce a novel unified framework designed to detect hallucinations across diverse large text-to-video (T2V) models.
Our framework is built upon a comprehensive analysis of hallucination phenomena, categorizing them based on their manifestation in the video content.
SoraDetector provides a robust and quantifiable measure of consistency, static and dynamic hallucination.
arXiv Detail & Related papers (2024-05-07T10:39:14Z) - Detecting AI-Generated Video via Frame Consistency [25.290019967304616]
We propose an open-source dataset and a detection method for generated video for the first time.<n>First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions.<n>Second, we find via probing experiments that spatial artifact-based detectors lack generalizability.
arXiv Detail & Related papers (2024-02-03T08:52:06Z) - LGDN: Language-Guided Denoising Network for Video-Language Modeling [30.99646752913056]
We propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN) for video-language modeling.
Our LGDN dynamically filters out the misaligned or redundant frames under the language supervision and obtains only 2--4 salient frames per video for cross-modal token-level alignment.
arXiv Detail & Related papers (2022-09-23T03:35:59Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Few-Shot Video Object Detection [70.43402912344327]
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
arXiv Detail & Related papers (2021-04-30T07:38:04Z) - Comprehensive Information Integration Modeling Framework for Video
Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.