HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly
- URL: http://arxiv.org/abs/2507.19924v2
- Date: Fri, 01 Aug 2025 12:25:21 GMT
- Title: HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly
- Authors: Chang Liu, Yunfan Ye, Fan Zhang, Qingyang Zhou, Yuchuan Luo, Zhiping Cai,
- Abstract summary: HumanSAM aims to classify humancentric forgeries into three distinct types of artifacts commonly observed in generated content.<n>HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
- Score: 15.347208661111198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly. To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
Related papers
- Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning [50.76723760768117]
Existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos.<n>We find that human appearance can provide a straightforward cue to address these obstacles.<n>We propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws.
arXiv Detail & Related papers (2025-07-03T12:19:26Z) - Multi-identity Human Image Animation with Structural Video Diffusion [64.20452431561436]
We present Structural Video Diffusion, a novel framework for generating realistic multi-human videos.<n>Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals.<n>We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z) - Video Anomaly Detection with Contours - A Study [24.525564527855092]
We investigate the potential of learning recurrent motion patterns of normal human behavior using 2D contours.<n>Our results indicate that this novel perspective on Pose-based Video Anomaly Detection marks a promising direction for future research.
arXiv Detail & Related papers (2025-03-25T12:11:50Z) - HumanGif: Single-View Human Diffusion with Generative Prior [25.516544735593087]
We propose HumanGif, a single-view human diffusion model with generative priors.<n>Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process.<n>We show that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.
arXiv Detail & Related papers (2025-02-17T17:55:27Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset.<n>Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos.<n>Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z) - MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - Human-Aware Motion Deblurring [197.53076361425363]
This paper proposes a human-aware deblurring model that disentangles the motion blur between foreground (FG) humans and background (BG)
The proposed model is based on a triple-branch encoder-decoder architecture.
The proposed model is further endowed with a supervised, human-aware attention mechanism in an end-to-end fashion.
arXiv Detail & Related papers (2020-01-19T12:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.