PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance
- URL: http://arxiv.org/abs/2504.18866v1
- Date: Sat, 26 Apr 2025 09:29:10 GMT
- Title: PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance
- Authors: Jiaxu Leng, Zhanjie Wu, Mingpi Tan, Mengjingcheng Mo, Jiankang Zheng, Qingqing Li, Ji Gan, Xinbo Gao,
- Abstract summary: Existing weakly supervised video violence detection methods rely on Euclidean representation learning.<n>We propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries.<n>Experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance.
- Score: 39.38656685766509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing weakly supervised video violence detection (VVD) methods primarily rely on Euclidean representation learning, which often struggles to distinguish visually similar yet semantically distinct events due to limited hierarchical modeling and insufficient ambiguous training samples. To address this challenge, we propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation. Specifically, PiercingEye introduces a layer-sensitive hyperbolic aggregation strategy with hyperbolic Dirichlet energy constraints to progressively model event hierarchies, and a cross-space attention mechanism to facilitate complementary feature interactions between Euclidean and hyperbolic spaces. Furthermore, to mitigate the scarcity of ambiguous samples, we leverage large language models to generate logic-guided ambiguous event descriptions, enabling explicit supervision through a hyperbolic vision-language contrastive loss that prioritizes high-confusion samples via dynamic similarity-aware weighting. Extensive experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance, with particularly strong results on a newly curated ambiguous event subset, validating its superior capability in fine-grained violence detection.
Related papers
- Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection [2.749898166276854]
weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction.
We propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity.
We show that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
arXiv Detail & Related papers (2024-12-29T12:46:57Z) - Towards Effective, Efficient and Unsupervised Social Event Detection in the Hyperbolic Space [54.936897625837474]
This work introduces an unsupervised framework, HyperSED (Hyperbolic SED).<n>Specifically, the framework first models social messages into semantic-based message anchors, and then leverages the structure of the anchor graph.<n>Experiments on public datasets demonstrate HyperSED's competitive performance, along with a substantial improvement in efficiency.
arXiv Detail & Related papers (2024-12-14T06:55:27Z) - Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence Detection [41.37736889402566]
We develop a novel Dual-Space Representation Learning (DSRL) method for weakly supervised Video Violence Detection (VVD)
Our method captures the visual features of events while also exploring the intrinsic relations between events, thereby enhancing the discriminative capacity of the features.
arXiv Detail & Related papers (2024-09-28T05:54:20Z) - StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model [62.25424831998405]
StealthDiffusion is a framework that modifies AI-generated images into high-quality, imperceptible adversarial examples.
It is effective in both white-box and black-box settings, transforming AI-generated images into high-quality adversarial forgeries.
arXiv Detail & Related papers (2024-08-11T01:22:29Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Hyperbolic Face Anti-Spoofing [21.981129022417306]
We propose to learn richer hierarchical and discriminative spoofing cues in hyperbolic space.
For unimodal FAS learning, the feature embeddings are projected into the Poincar'e ball, and then the hyperbolic binary logistic regression layer is cascaded for classification.
To alleviate the vanishing gradient problem in hyperbolic space, a new feature clipping method is proposed to enhance the training stability of hyperbolic models.
arXiv Detail & Related papers (2023-08-17T17:18:21Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic
Space [17.30264225835736]
HyperVD is a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination.
Our framework comprises a detour fusion module for multimodal fusion.
By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent and normal events.
arXiv Detail & Related papers (2023-05-30T07:18:56Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.