KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal
Action Localization
- URL: http://arxiv.org/abs/2111.03319v1
- Date: Fri, 5 Nov 2021 08:39:36 GMT
- Title: KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal
Action Localization
- Authors: Kalana Abeywardena, Shechem Sumanthiran, Sakuna Jayasundara, Sachira
Karunasena, Ranga Rodrigo, Peshala Jayasekara
- Abstract summary: Real-time and online action localization in a video is a critical yet highly challenging problem.
Recent attempts achieve this by using computationally intensive 3D CNN architectures or highly redundant two-stream architectures with optical flow.
We propose utilizing fast and efficient key-point based bounding box prediction to spatially localize actions.
Our model achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over contemporary real-time methods.
- Score: 0.9507070656654633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-time and online action localization in a video is a critical yet highly
challenging problem. Accurate action localization requires the utilization of
both temporal and spatial information. Recent attempts achieve this by using
computationally intensive 3D CNN architectures or highly redundant two-stream
architectures with optical flow, making them both unsuitable for real-time,
online applications. To accomplish activity localization under highly
challenging real-time constraints, we propose utilizing fast and efficient
key-point based bounding box prediction to spatially localize actions. We then
introduce a tube-linking algorithm that maintains the continuity of action
tubes temporally in the presence of occlusions. Further, we eliminate the need
for a two-stream architecture by combining temporal and spatial information
into a cascaded input to a single network, allowing the network to learn from
both types of information. Temporal information is efficiently extracted using
a structural similarity index map as opposed to computationally intensive
optical flow. Despite the simplicity of our approach, our lightweight
end-to-end architecture achieves state-of-the-art frame-mAP of 74.7% on the
challenging UCF101-24 dataset, demonstrating a performance gain of 6.4% over
the previous best online methods. We also achieve state-of-the-art video-mAP
results compared to both online and offline methods. Moreover, our model
achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over
contemporary real-time methods.
Related papers
- STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing [6.872340834265972]
We propose STLight, a novel method for S-temporal learning that relies solely on channel-wise and depth-wise convolutions as learnable layers.
STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together.
Our architecture achieves state-of-the-art performance on STL benchmarks across datasets and settings, while significantly improving computational efficiency in terms of parameters and computational FLOPs.
arXiv Detail & Related papers (2024-11-15T13:53:19Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Fast Neural Scene Flow [36.29234109363439]
A coordinate neural network estimates scene flow at runtime, without any training.
In this paper, we demonstrate that scene flow is different -- with the dominant computational bottleneck stemming from the loss function itself.
Our fast neural scene flow (FNSF) approach reports for the first time real-time performance comparable to learning methods.
arXiv Detail & Related papers (2023-04-18T16:37:18Z) - Multiple Object Tracking with Correlation Learning [16.959379957515974]
We propose to exploit the local correlation module to model the topological relationship between targets and their surrounding environment.
Specifically, we establish dense correspondences of each spatial location and its context, and explicitly constrain the correlation volumes through self-supervised learning.
Our approach demonstrates the effectiveness of correlation learning with the superior performance and obtains state-of-the-art MOTA of 76.5% and IDF1 of 73.6% on MOT17.
arXiv Detail & Related papers (2021-04-08T06:48:02Z) - Reinforcement Learning with Latent Flow [78.74671595139613]
Flow of Latents for Reinforcement Learning (Flare) is a network architecture for RL that explicitly encodes temporal information through latent vector differences.
We show that Flare recovers optimal performance in state-based RL without explicit access to the state velocity.
We also show that Flare achieves state-of-the-art performance on pixel-based challenging continuous control tasks within the DeepMind control benchmark suite.
arXiv Detail & Related papers (2021-01-06T03:50:50Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z) - Exploring Rich and Efficient Spatial Temporal Interactions for Real Time
Video Salient Object Detection [87.32774157186412]
Main stream methods formulate their video saliency mainly from two independent venues, i.e., the spatial and temporal branches.
In this paper, we propose atemporal network to achieve such improvement in a full interactive fashion.
Our method is easy to implement yet effective, achieving high quality video saliency detection in real-time speed with 50 FPS.
arXiv Detail & Related papers (2020-08-07T03:24:04Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z) - Real-Time High-Performance Semantic Image Segmentation of Urban Street
Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes.
The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.