Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking
- URL: http://arxiv.org/abs/2507.22421v1
- Date: Wed, 30 Jul 2025 06:49:11 GMT
- Title: Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking
- Authors: Shahla John,
- Abstract summary: Real-time video analysis remains a challenging problem in computer vision.<n>We present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking.<n>Our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Real-time video analysis remains a challenging problem in computer vision, requiring efficient processing of both spatial and temporal information while maintaining computational efficiency. Existing approaches often struggle to balance accuracy and speed, particularly in resource-constrained environments. In this work, we present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking. Our approach builds upon recent advances in parallel sequence modeling and introduces a novel hierarchical attention mechanism that adaptively focuses on relevant spatial regions across temporal sequences. We demonstrate that our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds. Extensive experiments on UCF-101, HMDB-51, and MOT17 datasets show improvements of 3.2% in action recognition accuracy and 2.8% in tracking precision compared to existing methods, with 40% faster inference time.
Related papers
- Occupancy Learning with Spatiotemporal Memory [39.41175479685905]
We propose a scene-level occupancy representation learning framework that effectively learns 3D occupancy feature with temporal consistency.<n>Our method significantly enhances thetemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs.
arXiv Detail & Related papers (2025-08-06T17:59:52Z) - CAST: Cross-Attentive Spatio-Temporal feature fusion for Deepfake detection [0.0]
CNNs are effective at capturing spatial artifacts, and Transformers excel at modeling temporal inconsistencies.<n>We propose a unified CAST model that leverages cross-attention to effectively fuse spatial and temporal features.<n>We evaluate the performance of our model using the FaceForensics++, Celeb-DF, and DeepfakeDetection datasets.
arXiv Detail & Related papers (2025-06-26T18:51:17Z) - Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction [12.064509280163502]
3D occupancy prediction has emerged as a key perception task for autonomous driving.<n>Recent studies focus on integrating information obtained from past observations to improve prediction accuracy.<n>We propose StreamOcc, a framework that aggregates past-temporal information in a stream-based manner.<n>Experiments on the Occ3D-nus dataset show that StreamOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by more than 50% compared to previous methods.
arXiv Detail & Related papers (2025-03-28T02:05:53Z) - CAST: Cross-Attention in Space and Time for Video Action Recognition [8.785207228156098]
We propose a novel two-stream architecture called Cross-Attention in Space and Time (CAST)
CAST achieves a balanced spatial-temporal understanding of videos using only balanced input.
Our proposed mechanism enables spatial and temporal expert models to exchange information and make synergistic predictions.
arXiv Detail & Related papers (2023-11-30T18:58:51Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal
Action Localization [0.9507070656654633]
Real-time and online action localization in a video is a critical yet highly challenging problem.
Recent attempts achieve this by using computationally intensive 3D CNN architectures or highly redundant two-stream architectures with optical flow.
We propose utilizing fast and efficient key-point based bounding box prediction to spatially localize actions.
Our model achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over contemporary real-time methods.
arXiv Detail & Related papers (2021-11-05T08:39:36Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Unsupervised Feature Learning for Event Data: Direct vs Inverse Problem
Formulation [53.850686395708905]
Event-based cameras record an asynchronous stream of per-pixel brightness changes.
In this paper, we focus on single-layer architectures for representation learning from event data.
We show improvements of up to 9 % in the recognition accuracy compared to the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-23T10:40:03Z) - Exploring Rich and Efficient Spatial Temporal Interactions for Real Time
Video Salient Object Detection [87.32774157186412]
Main stream methods formulate their video saliency mainly from two independent venues, i.e., the spatial and temporal branches.
In this paper, we propose atemporal network to achieve such improvement in a full interactive fashion.
Our method is easy to implement yet effective, achieving high quality video saliency detection in real-time speed with 50 FPS.
arXiv Detail & Related papers (2020-08-07T03:24:04Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.