VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning
- URL: http://arxiv.org/abs/2602.08828v1
- Date: Mon, 09 Feb 2026 16:00:01 GMT
- Title: VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning
- Authors: Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, Zhen Lei,
- Abstract summary: VideoVeritas is a framework for fine-grained perception and fact-based reasoning.<n>Joint Perception Preference and Perception Pretext Reinforcement Learning is used.
- Score: 42.22791607763693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.
Related papers
- Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning [28.87800134659646]
Video-o3 is a novel framework that supports iterative discovery of salient visual clues.<n>Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes.
arXiv Detail & Related papers (2026-01-30T17:47:30Z) - EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning [58.42596067220998]
deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation.<n>Traditional deepfake video detection methods face issues such as a lack of transparency in their principles and insufficient capabilities to cope with forgery techniques.<n>This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal reasoning framework.
arXiv Detail & Related papers (2025-10-18T10:34:05Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs [61.64185573373394]
We propose a training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal.<n>We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data.<n>Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.
arXiv Detail & Related papers (2025-10-01T09:20:51Z) - Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z) - ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z) - LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection [14.687867348598035]
Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection.<n>We propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement.<n>Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting.
arXiv Detail & Related papers (2025-02-20T19:34:58Z) - Zero-Shot Action Recognition in Surveillance Videos [5.070026408553652]
Current AI-based video surveillance systems rely on core computer vision models that require extensive finetuning.<n>VideoLLaMA2 represents a significant leap in zero-shot performance, with 20% boost over the baseline.<n>Self-ReS additionally increases zero-shot action recognition performance to 44.6%.
arXiv Detail & Related papers (2024-10-28T15:13:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.