Time and Frequency Network for Human Action Detection in Videos
- URL: http://arxiv.org/abs/2103.04680v1
- Date: Mon, 8 Mar 2021 11:42:05 GMT
- Title: Time and Frequency Network for Human Action Detection in Videos
- Authors: Changhai Li, Huawei Chen, Jingqing Lu, Yang Huang and Yingying Liu
- Abstract summary: We propose an end-to-end network that considers the time and frequency features simultaneously, named TFNet.
To obtain the action patterns, these two features are deeply fused under the attention mechanism.
- Score: 6.78349879472022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, spatiotemporal features are embraced by most deep learning
approaches for human action detection in videos, however, they neglect the
important features in frequency domain. In this work, we propose an end-to-end
network that considers the time and frequency features simultaneously, named
TFNet. TFNet holds two branches, one is time branch formed of three-dimensional
convolutional neural network(3D-CNN), which takes the image sequence as input
to extract time features; and the other is frequency branch, extracting
frequency features through two-dimensional convolutional neural network(2D-CNN)
from DCT coefficients. Finally, to obtain the action patterns, these two
features are deeply fused under the attention mechanism. Experimental results
on the JHMDB51-21 and UCF101-24 datasets demonstrate that our approach achieves
remarkable performance for frame-mAP.
Related papers
- 2D bidirectional gated recurrent unit convolutional Neural networks for end-to-end violence detection In videos [0.0]
We propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences.
A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames.
arXiv Detail & Related papers (2024-09-11T19:36:12Z) - EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition [0.0]
We present an efficient pose-driven attention-guided multimodal action recognition (EPAM-Net) for action recognition in videos.
Specifically, we adapted X3D networks for both pose streams and network-temporal features from RGB videos and their skeleton sequences.
Our model provides a 6.2-9.9-x reduction in FLOPs (floating-point operation, in number of multiply-adds) and a 9--9.6x reduction in the number of network parameters.
arXiv Detail & Related papers (2024-08-10T03:15:24Z) - Time-space-frequency feature Fusion for 3-channel motor imagery
classification [0.0]
This study introduces TSFF-Net, a novel network architecture that integrates time-space-frequency features.
TSFF-Net comprises four main components: time-frequency representation, time-frequency feature extraction, time-space feature extraction, and feature fusion and classification.
Experimental results demonstrate that TSFF-Net not only compensates for the shortcomings of single-mode feature extraction networks in EEG decoding, but also outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2023-04-04T02:01:48Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in
VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images.
This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z) - STSM: Spatio-Temporal Shift Module for Efficient Action Recognition [4.096670184726871]
We propose a plug-and-play Spatio-temporal Shift Module (STSM) that is both effective and high-performance.
In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features.
arXiv Detail & Related papers (2021-12-05T09:40:49Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion
Compensation for Action Recognition in the EPIC-Kitchens Dataset [68.8204255655161]
Action recognition is one of the top-challenging research fields in computer vision.
ego-motion recorded sequences have become of important relevance.
The proposed method aims to cope with it by estimating this ego-motion or camera motion.
arXiv Detail & Related papers (2020-08-26T14:44:45Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks.
TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.