Related papers: A Tactical Behaviour Recognition Framework Based on Causal Multimodal Reasoning: A Study on Covert Audio-Video Analysis Combining GAN Structure Enhancement and Phonetic Accent Modelling

A Tactical Behaviour Recognition Framework Based on Causal Multimodal Reasoning: A Study on Covert Audio-Video Analysis Combining GAN Structure Enhancement and Phonetic Accent Modelling

URL: http://arxiv.org/abs/2507.21100v1
Date: Fri, 04 Jul 2025 15:43:43 GMT
Title: A Tactical Behaviour Recognition Framework Based on Causal Multimodal Reasoning: A Study on Covert Audio-Video Analysis Combining GAN Structure Enhancement and Phonetic Accent Modelling
Authors: Wei Meng,
Abstract summary: TACTIC-GRAPHS is a system that combines spectral graph theory and multimodal graph neural reasoning for semantic understanding and threat detection in tactical video.<n>The framework incorporates spectral embedding, temporal causal edge modeling, and discriminative path inference across heterogeneous modalities.<n> Experiments on TACTIC-AVS and TACTIC-Voice datasets show 89.3 percent accuracy in temporal alignment and over 85 percent recognition of complete threat chains, with node latency within plus-minus 150 milliseconds.
Score: 3.5516803380598074
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces TACTIC-GRAPHS, a system that combines spectral graph theory and multimodal graph neural reasoning for semantic understanding and threat detection in tactical video under high noise and weak structure. The framework incorporates spectral embedding, temporal causal edge modeling, and discriminative path inference across heterogeneous modalities. A semantic-aware keyframe extraction method fuses visual, acoustic, and action cues to construct temporal graphs. Using graph attention and Laplacian spectral mapping, the model performs cross-modal weighting and causal signal analysis. Experiments on TACTIC-AVS and TACTIC-Voice datasets show 89.3 percent accuracy in temporal alignment and over 85 percent recognition of complete threat chains, with node latency within plus-minus 150 milliseconds. The approach enhances structural interpretability and supports applications in surveillance, defense, and intelligent security systems.

Related papers

DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection [7.117824587276951]
This study offers a dual-path architecture called the Dual-Branch Adaptive Multiscale Stemporal Framework (DAMS), which is based on multilevel feature and decoupling fusion.<n>The main processing path integrates the Adaptive Multiscale Time Pyramid Network (AMTPN) with the Convolutional Block Attention Mechanism (CBAM)
arXiv Detail & Related papers (2025-07-28T08:42:00Z)
A study on audio synchronous steganography detection and distributed guide inference model based on sliding spectral features and intelligent inference drive [3.5516803380598074]
This paper proposes a detection and distributed guidance reconstruction model based on short video "Yupan" samples released by China's South Sea Fleet on TikTok.<n>The proposed framework validates the effectiveness of sliding spectral features for synchronized steganography detection and builds an inference model for covert communication analysis and tactical guidance simulation on open platforms.
arXiv Detail & Related papers (2025-05-06T05:24:11Z)
CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis [75.25966323298003]
Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding.<n> variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies.<n>We introduce $textbfCARL$, a model for $textbfC$amera-$textbfA$gnostic $textbfR$esupervised $textbfL$ across RGB, multispectral, and hyperspectral imaging modalities.
arXiv Detail & Related papers (2025-04-27T13:06:40Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
Neural Harmonium: An Interpretable Deep Structure for Nonlinear Dynamic System Identification with Application to Audio Processing [4.599180419117645]
Interpretability helps us understand a model's ability to generalize and reveal its limitations. We introduce a causal interpretable deep structure for modeling dynamic systems. Our proposed model makes use of the harmonic analysis by modeling the system in a time-frequency domain.
arXiv Detail & Related papers (2023-10-10T21:32:15Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Synergistic Signal Denoising for Multimodal Time Series of Structure Vibration [9.144905626316534]
This paper introduces a novel deep learning algorithm tailored for the complexities inherent in multimodal vibration signals prevalent in Structural Health Monitoring (SHM) By amalgamating convolutional and recurrent architectures, the algorithm adeptly captures both localized and prolonged structural behaviors. Our results showcase significant improvements in predictive accuracy, early damage detection, and adaptability across multiple SHM scenarios.
arXiv Detail & Related papers (2023-08-17T00:41:50Z)
PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation [50.556961575275345]
We propose a perception-aware fusion framework to promote segmentation robustness in adversarial scenes. We show that our scheme substantially enhances the robustness, with gains of 15.3% mIOU, compared with advanced competitors.
arXiv Detail & Related papers (2023-08-08T01:55:44Z)
Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection) In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z)
AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks [45.2410605401286]
We seek to develop an efficient, single system that can detect a broad range of different spoofing attacks without score-level ensembles. We propose a novel heterogeneous stacking graph attention layer which models artefacts spanning heterogeneous temporal and spectral domains. Our approach, named AASIST, outperforms the current state-of-the-art by 20% relative.
arXiv Detail & Related papers (2021-10-04T05:48:25Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.