Related papers: AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

URL: http://arxiv.org/abs/2505.15173v3
Date: Tue, 23 Sep 2025 14:29:40 GMT
Title: AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection
Authors: Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang,
Abstract summary: Human-centric video generation methods can synthesize entire human bodies with controllable movements.<n>Existing detection methods largely overlook the growing risks posed by such full-body synthetic content.<n>We propose AvatarShield, a novel multimodal human-centric synthetic video detection framework.
Score: 20.800161778433914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Artificial Intelligence Generated Content have led to highly realistic synthetic videos, particularly in human-centric scenarios involving speech, gestures, and full-body motion, posing serious threats to information authenticity and public trust. Unlike DeepFake techniques that focus on localized facial manipulation, human-centric video generation methods can synthesize entire human bodies with controllable movements, enabling complex interactions with environments, objects, and even other people. However, existing detection methods largely overlook the growing risks posed by such full-body synthetic content. Meanwhile, a growing body of research has explored leveraging LLMs for interpretable fake detection, aiming to explain decisions in natural language. Yet these approaches heavily depend on supervised fine-tuning, which introduces limitations such as annotation bias, hallucinated supervision, and weakened generalization. To address these challenges, we propose AvatarShield, a novel multimodal human-centric synthetic video detection framework that eliminates the need for dense textual supervision by adopting Group Relative Policy Optimization, enabling LLMs to develop reasoning capabilities from simple binary labels. Our architecture combines a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. We further introduce FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos across nine state-of-the-art human generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both in-domain and cross-domain settings.

Related papers

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents [85.77432303199176]
We propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones.<n>Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes.<n>Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via
arXiv Detail & Related papers (2026-02-26T16:53:41Z)
Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z)
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning [66.51617619673587]
We present Skyra, a specialized large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos.<n>To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video dataset with fine-grained human annotations.<n>We then develop a two-stage training strategy that systematically enhances our model's artifact's-temporal perception, explanation capability, and detection accuracy.
arXiv Detail & Related papers (2025-12-17T18:48:26Z)
From Generated Human Videos to Physically Plausible Robot Trajectories [103.28274349461607]
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts.<n>To realize this potential, how can a humanoid execute the human actions from generated videos in a zero-shot manner?<n>This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video.<n>We propose GenMimic, a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards.
arXiv Detail & Related papers (2025-12-04T18:56:03Z)
HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly [15.347208661111198]
HumanSAM aims to classify humancentric forgeries into three distinct types of artifacts commonly observed in generated content.<n>HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
arXiv Detail & Related papers (2025-07-26T12:03:47Z)
Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z)
Robustness in AI-Generated Detection: Enhancing Resistance to Adversarial Attacks [4.179092469766417]
This paper investigates the vulnerabilities of current AI-generated face detection systems.<n>We propose an approach that integrates adversarial training to mitigate the impact of adversarial examples.<n>We also provide an in-depth analysis of adversarial and benign examples, offering insights into the intrinsic characteristics of AI-generated content.
arXiv Detail & Related papers (2025-05-06T11:19:01Z)
A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation [93.28532038721816]
Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields.<n>We propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples.
arXiv Detail & Related papers (2025-04-11T10:18:13Z)
Unleashing the Power of Pre-trained Encoders for Universal Adversarial Attack Detection [21.03032944637112]
Adrial attacks pose a critical security threat to real-world AI systems.<n>This paper proposes a lightweight adversarial detection framework based on the large-scale pre-trained vision-language model CLIP.
arXiv Detail & Related papers (2025-04-01T05:21:45Z)
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics [66.14786900470158]
We propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics.<n>FakeScope identifies AI-synthetic images with high accuracy and provides rich, interpretable, and query-driven forensic insights.<n>FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios.
arXiv Detail & Related papers (2025-03-31T16:12:48Z)
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z)
Human Action CLIPs: Detecting AI-generated Human Motion [13.106063755117399]
We describe an effective and robust technique for distinguishing real from AI-generated human motion using multi-modal semantic embeddings.<n>This method is evaluated against DeepAction, a custom-built, open-sourced dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage.
arXiv Detail & Related papers (2024-11-30T16:20:58Z)
Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models.<n>In this paper, we investigate how detection performance varies across model backbones, types, and datasets.<n>We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z)
A Multimodal Framework for Deepfake Detection [0.0]
Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. Our research addresses the critical issue of deepfakes through an innovative multimodal approach. Our framework combines visual and auditory analyses, yielding an accuracy of 94%.
arXiv Detail & Related papers (2024-10-04T14:59:10Z)
Adversarial Robustness of AI-Generated Image Detectors in the Real World [13.52355280061187]
We show that current state-of-the-art classifiers are vulnerable to adversarial examples under real-world conditions.<n>Most attacks remain effective even when images are degraded during the upload to, e.g., social media platforms.<n>In a case study, we demonstrate that these robustness challenges are also found in commercial tools by conducting black-box attacks on HIVE.
arXiv Detail & Related papers (2024-10-02T14:11:29Z)
UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z)
The Tug-of-War Between Deepfake Generation and Detection [4.62070292702111]
Multimodal generative models are rapidly evolving, leading to a surge in the generation of realistic video and audio. Deepfake videos, which can convincingly impersonate individuals, have particularly garnered attention due to their potential misuse. This survey paper examines the dual landscape of deepfake video generation and detection, emphasizing the need for effective countermeasures.
arXiv Detail & Related papers (2024-07-08T17:49:41Z)
HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes [21.2539366684941]
We propose an unsupervised 3D detection method for human-centric scenarios by transferring the knowledge from synthetic human instances to real scenes. Remarkably, our method exhibits superior performance compared to current state-of-the-art techniques.
arXiv Detail & Related papers (2024-03-05T08:37:05Z)
NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos. We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics. Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z)
Human View Synthesis using a Single Sparse RGB-D Input [16.764379184593256]
We present a novel view synthesis framework to generate realistic renders from unseen views of any human captured from a single-view sensor with sparse RGB-D. An enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details.
arXiv Detail & Related papers (2021-12-27T20:13:53Z)
VideoForensicsHQ: Detecting High-quality Manipulated Face Videos [77.60295082172098]
We show how the performance of forgery detectors depends on the presence of artefacts that the human eye can see. We introduce a new benchmark dataset for face video forgery detection, of unprecedented quality.
arXiv Detail & Related papers (2020-05-20T21:17:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.