Live-E2T: Real-time Threat Monitoring in Video via Deduplicated Event Reasoning and Chain-of-Thought
- URL: http://arxiv.org/abs/2509.18571v1
- Date: Tue, 23 Sep 2025 02:53:43 GMT
- Title: Live-E2T: Real-time Threat Monitoring in Video via Deduplicated Event Reasoning and Chain-of-Thought
- Authors: Yuhan Wang, Cheng Liu, Zihan Zhao, Weichao Wu,
- Abstract summary: Live-E2T is a novel framework that unifies the requirements of real-time performance and decision explainability.<n>We show that Live-E2T significantly outperforms state-of-the-art methods in terms of threat detection accuracy, real-time efficiency, and the crucial of explainability.
- Score: 15.651072801329425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time threat monitoring identifies threatening behaviors in video streams and provides reasoning and assessment of threat events through explanatory text. However, prevailing methodologies, whether based on supervised learning or generative models, struggle to concurrently satisfy the demanding requirements of real-time performance and decision explainability. To bridge this gap, we introduce Live-E2T, a novel framework that unifies these two objectives through three synergistic mechanisms. First, we deconstruct video frames into structured Human-Object-Interaction-Place semantic tuples. This approach creates a compact, semantically focused representation, circumventing the information degradation common in conventional feature compression. Second, an efficient online event deduplication and updating mechanism is proposed to filter spatio-temporal redundancies, ensuring the system's real time responsiveness. Finally, we fine-tune a Large Language Model using a Chain-of-Thought strategy, endow it with the capability for transparent and logical reasoning over event sequences to produce coherent threat assessment reports. Extensive experiments on benchmark datasets, including XD-Violence and UCF-Crime, demonstrate that Live-E2T significantly outperforms state-of-the-art methods in terms of threat detection accuracy, real-time efficiency, and the crucial dimension of explainability.
Related papers
- Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification [0.0]
This work introduces a cascading multi-agent framework that unifies complementary paradigms into a coherent and interpretable architecture.<n>Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events.<n>The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.
arXiv Detail & Related papers (2026-01-08T11:31:47Z) - Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models [67.13397169618624]
We introduce T2VAttack, a study of adversarial attacks on Text-to-Video (T2V) models from both semantic and temporal perspectives.<n>To achieve an effective and efficient attack process, we propose two adversarial attack methods: T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt.
arXiv Detail & Related papers (2025-12-30T03:00:46Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - FlowXpert: Context-Aware Flow Embedding for Enhanced Traffic Detection in IoT Network [7.30584204219718]
In the Internet of Things (IoT) environment, continuous interaction among a large number of devices generates complex and dynamic network traffic.<n>Machine learning (ML)-based traffic detection technology serves as a critical component in ensuring network security.
arXiv Detail & Related papers (2025-09-25T07:52:58Z) - Dynamic Temporal Positional Encodings for Early Intrusion Detection in IoT [3.6686692131754834]
The rapid expansion of the Internet of Things (IoT) has introduced significant security challenges.<n>Traditional Intrusion Detection Systems (IDS) often overlook the temporal characteristics of network traffic.<n>We propose a Transformer-based Early Intrusion Detection System (EIDS) that incorporates dynamic temporal positional encodings.
arXiv Detail & Related papers (2025-06-22T17:56:19Z) - T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z) - Towards Effective, Efficient and Unsupervised Social Event Detection in the Hyperbolic Space [54.936897625837474]
This work introduces an unsupervised framework, HyperSED (Hyperbolic SED).<n>Specifically, the framework first models social messages into semantic-based message anchors, and then leverages the structure of the anchor graph.<n>Experiments on public datasets demonstrate HyperSED's competitive performance, along with a substantial improvement in efficiency.
arXiv Detail & Related papers (2024-12-14T06:55:27Z) - Context-Conditioned Spatio-Temporal Predictive Learning for Reliable V2V Channel Prediction [25.688521281119037]
Vehicle-to-Vehicle (V2V) channel state information (CSI) prediction is challenging and crucial for optimizing downstream tasks.
Traditional prediction approaches focus on four-dimensional (4D) CSI, which includes predictions over time, bandwidth, and antenna (TX and RX) space.
We propose a novel context-conditionedtemporal predictive learning method to capture dependencies within 4D CSI data.
arXiv Detail & Related papers (2024-09-16T04:15:36Z) - SAFE-SIM: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries [94.84458417662407]
We introduce SAFE-SIM, a controllable closed-loop safety-critical simulation framework.
Our approach yields two distinct advantages: 1) generating realistic long-tail safety-critical scenarios that closely reflect real-world conditions, and 2) providing controllable adversarial behavior for more comprehensive and interactive evaluations.
We validate our framework empirically using the nuScenes and nuPlan datasets across multiple planners, demonstrating improvements in both realism and controllability.
arXiv Detail & Related papers (2023-12-31T04:14:43Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.