PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance
- URL: http://arxiv.org/abs/2504.18866v1
- Date: Sat, 26 Apr 2025 09:29:10 GMT
- Title: PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance
- Authors: Jiaxu Leng, Zhanjie Wu, Mingpi Tan, Mengjingcheng Mo, Jiankang Zheng, Qingqing Li, Ji Gan, Xinbo Gao,
- Abstract summary: Existing weakly supervised video violence detection methods rely on Euclidean representation learning.<n>We propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries.<n>Experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance.
- Score: 39.38656685766509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing weakly supervised video violence detection (VVD) methods primarily rely on Euclidean representation learning, which often struggles to distinguish visually similar yet semantically distinct events due to limited hierarchical modeling and insufficient ambiguous training samples. To address this challenge, we propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation. Specifically, PiercingEye introduces a layer-sensitive hyperbolic aggregation strategy with hyperbolic Dirichlet energy constraints to progressively model event hierarchies, and a cross-space attention mechanism to facilitate complementary feature interactions between Euclidean and hyperbolic spaces. Furthermore, to mitigate the scarcity of ambiguous samples, we leverage large language models to generate logic-guided ambiguous event descriptions, enabling explicit supervision through a hyperbolic vision-language contrastive loss that prioritizes high-confusion samples via dynamic similarity-aware weighting. Extensive experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance, with particularly strong results on a newly curated ambiguous event subset, validating its superior capability in fine-grained violence detection.
Related papers
- CrystaL: Spontaneous Emergence of Visual Latents in MLLMs [55.34169914483764]
We propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images.<n>By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics.<n>Experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-24T15:01:30Z) - Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection [19.83275015213163]
Emotion Collider (EC-Net) is a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling.<n>EC-Net represents hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism.<n> Empirical results show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy.
arXiv Detail & Related papers (2026-02-18T03:19:05Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z) - Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection [2.749898166276854]
weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction.
We propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity.
We show that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
arXiv Detail & Related papers (2024-12-29T12:46:57Z) - Towards Effective, Efficient and Unsupervised Social Event Detection in the Hyperbolic Space [54.936897625837474]
This work introduces an unsupervised framework, HyperSED (Hyperbolic SED).<n>Specifically, the framework first models social messages into semantic-based message anchors, and then leverages the structure of the anchor graph.<n>Experiments on public datasets demonstrate HyperSED's competitive performance, along with a substantial improvement in efficiency.
arXiv Detail & Related papers (2024-12-14T06:55:27Z) - Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence Detection [41.37736889402566]
We develop a novel Dual-Space Representation Learning (DSRL) method for weakly supervised Video Violence Detection (VVD)
Our method captures the visual features of events while also exploring the intrinsic relations between events, thereby enhancing the discriminative capacity of the features.
arXiv Detail & Related papers (2024-09-28T05:54:20Z) - StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model [62.25424831998405]
StealthDiffusion is a framework that modifies AI-generated images into high-quality, imperceptible adversarial examples.
It is effective in both white-box and black-box settings, transforming AI-generated images into high-quality adversarial forgeries.
arXiv Detail & Related papers (2024-08-11T01:22:29Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Hyperbolic Face Anti-Spoofing [21.981129022417306]
We propose to learn richer hierarchical and discriminative spoofing cues in hyperbolic space.
For unimodal FAS learning, the feature embeddings are projected into the Poincar'e ball, and then the hyperbolic binary logistic regression layer is cascaded for classification.
To alleviate the vanishing gradient problem in hyperbolic space, a new feature clipping method is proposed to enhance the training stability of hyperbolic models.
arXiv Detail & Related papers (2023-08-17T17:18:21Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic
Space [17.30264225835736]
HyperVD is a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination.
Our framework comprises a detour fusion module for multimodal fusion.
By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent and normal events.
arXiv Detail & Related papers (2023-05-30T07:18:56Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.