Related papers: VGMShield: Mitigating Misuse of Video Generative Models

VGMShield: Mitigating Misuse of Video Generative Models

URL: http://arxiv.org/abs/2402.13126v2
Date: Fri, 04 Jul 2025 04:21:23 GMT
Title: VGMShield: Mitigating Misuse of Video Generative Models
Authors: Yan Pang, Baicheng Chen, Yang Zhang, Tianhao Wang,
Abstract summary: VGMShield is a set of straightforward but effective mitigations through the lifecycle of fake video generation.<n>We start from fake video detection, trying to understand whether there is uniqueness in generated videos.<n>Then, we investigate the fake video source tracing problem, which maps a fake video back to the model that generated it.
Score: 7.1819804607793705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid advancement in video generation, people can conveniently use video generation models to create videos tailored to their specific desires. As a result, there are also growing concerns about the potential misuse of video generation for spreading illegal content and misinformation. In this work, we introduce VGMShield: a set of straightforward but effective mitigations through the lifecycle of fake video generation. We start from fake video detection, trying to understand whether there is uniqueness in generated videos and whether we can differentiate them from real videos; then, we investigate the fake video source tracing problem, which maps a fake video back to the model that generated it. Towards these, we propose to leverage pre-trained models that focus on spatial-temporal dynamics as the backbone to identify inconsistencies in videos. In detail, we analyze fake videos from the perspective of the generation process. Based on the observation of attention shifts, motion variations, and frequency fluctuations, we identify common patterns in the generated video. These patterns serve as the foundation for our experiments on fake video detection and source tracing. Through experiments on seven state-of-the-art open-source models, we demonstrate that current models still cannot reliably reproduce spatial-temporal relationships, and thus, we can accomplish detection and source tracing with over 90% accuracy. Furthermore, anticipating future generative model improvements, we propose a prevention method that adds invisible perturbations to the query images to make the generated videos look unreal. Together with detection and tracing, our multi-faceted set of solutions can effectively mitigate misuse of video generative models.

Related papers

BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos [63.03271511550633]
BrokenVideos is a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption.<n>Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions.
arXiv Detail & Related papers (2025-06-25T03:30:04Z)
Direct Motion Models for Assessing Generated Videos [38.04485796547767]
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data.
arXiv Detail & Related papers (2025-04-30T22:34:52Z)
GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video [35.05198100139731]
We introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages.<n>The dataset includes videos from 8 state-of-the-art AI video generators.<n>It is analyzed from multiple dimensions and classified into various semantic categories based on their content.
arXiv Detail & Related papers (2025-01-20T08:58:56Z)
Generative Video Propagation [87.15843701018099]
Our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. Experiment results demonstrate the leading performance of our model in various video tasks.
arXiv Detail & Related papers (2024-12-27T17:42:29Z)
Can Generative Video Models Help Pose Estimation? [42.10672365565019]
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose. We propose a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition.
arXiv Detail & Related papers (2024-12-20T18:58:24Z)
Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.<n>By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.<n> Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix [60.48666051245761]
We propose a pose-free and training-free approach for generating 3D stereoscopic videos. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth. We develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting.
arXiv Detail & Related papers (2024-06-29T08:33:55Z)
What Matters in Detecting AI-Generated Videos like Sora? [51.05034165599385]
Gap between synthetic and real-world videos remains under-explored. In this study, we compare real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training.
arXiv Detail & Related papers (2024-06-27T23:03:58Z)
Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos [16.34393937800271]
generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. We propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models.
arXiv Detail & Related papers (2024-06-13T21:52:49Z)
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models [53.89348957053395]
We introduce a novel pipeline designed for text-to-4D scene generation. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video.
arXiv Detail & Related papers (2024-06-11T17:19:26Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
DeCoF: Generated Video Detection via Frame Consistency: The First Benchmark Dataset [32.236653072212015]
We propose an open-source dataset and a detection method for generated video for the first time. First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions. Second, we found via probing experiments that spatial artifact-based detectors lack generalizability.
arXiv Detail & Related papers (2024-02-03T08:52:06Z)
Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras. We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z)
Video Content Swapping Using GAN [1.2300363114433952]
In this work, we will break down any frame in the video into content and pose. We first extract the pose information from a video using a pre-trained human pose detection and use a generative model to synthesize the video based on the content code and pose code.
arXiv Detail & Related papers (2021-11-21T23:01:58Z)
Non-Adversarial Video Synthesis with Learned Priors [53.26777815740381]
We focus on the problem of generating videos from latent noise vectors, without any reference input frames. We develop a novel approach that jointly optimize the input latent space, the weights of a recurrent neural network and a generator through non-adversarial learning. Our approach generates superior quality videos compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2020-03-21T02:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.