F4D: Factorized 4D Convolutional Neural Network for Efficient
Video-level Representation Learning
- URL: http://arxiv.org/abs/2401.08609v1
- Date: Tue, 28 Nov 2023 19:21:57 GMT
- Title: F4D: Factorized 4D Convolutional Neural Network for Efficient
Video-level Representation Learning
- Authors: Mohammad Al-Saad, Lakshmish Ramaswamy and Suchendra Bhandarkar
- Abstract summary: Most existing 3D convolutional neural network (CNN)-based methods for video-level representation learning are clip-based.
We propose a factorized 4D CNN architecture with attention (F4D) that is capable of learning more effective, finer-grained, long-termtemporal video representations.
- Score: 4.123763595394021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have shown that video-level representation learning is crucial
to the capture and understanding of the long-range temporal structure for video
action recognition. Most existing 3D convolutional neural network (CNN)-based
methods for video-level representation learning are clip-based and focus only
on short-term motion and appearances. These CNN-based methods lack the capacity
to incorporate and model the long-range spatiotemporal representation of the
underlying video and ignore the long-range video-level context during training.
In this study, we propose a factorized 4D CNN architecture with attention (F4D)
that is capable of learning more effective, finer-grained, long-term
spatiotemporal video representations. We demonstrate that the proposed F4D
architecture yields significant performance improvements over the conventional
2D, and 3D CNN architectures proposed in the literature. Experiment evaluation
on five action recognition benchmark datasets, i.e., Something-Something-v1,
SomethingSomething-v2, Kinetics-400, UCF101, and HMDB51 demonstrate the
effectiveness of the proposed F4D network architecture for video-level action
recognition.
Related papers
- Intelligent 3D Network Protocol for Multimedia Data Classification using
Deep Learning [0.0]
We implement Hybrid Deep Learning Architecture that combines STIP and 3D CNN features to enhance the performance of 3D videos effectively.
The results are compared with state-of-the-art frameworks from literature for action recognition on UCF101 with an accuracy of 95%.
arXiv Detail & Related papers (2022-07-23T12:24:52Z) - In Defense of Image Pre-Training for Spatiotemporal Recognition [32.56468478601864]
Key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features.
New pipeline consistently achieves better results on video recognition with significant speedup.
arXiv Detail & Related papers (2022-05-03T18:45:44Z) - MoViNets: Mobile Video Networks for Efficient Video Recognition [52.49314494202433]
3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets.
We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs.
arXiv Detail & Related papers (2021-03-21T23:06:38Z) - Learning Compositional Representation for 4D Captures with Neural ODE [72.56606274691033]
We introduce a compositional representation for 4D captures, that disentangles shape, initial state, and motion respectively.
To model the motion, a neural Ordinary Differential Equation (ODE) is trained to update the initial state conditioned on the learned motion code.
A decoder takes the shape code and the updated pose code to reconstruct 4D captures at each time stamp.
arXiv Detail & Related papers (2021-03-15T10:55:55Z) - TCLR: Temporal Contrastive Learning for Video Representation [49.6637562402604]
We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods.
With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification.
arXiv Detail & Related papers (2021-01-20T05:38:16Z) - 3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture.
SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - V4D:4D Convolutional Neural Networks for Video-level Representation
Learning [58.548331848942865]
Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features.
We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions.
V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
arXiv Detail & Related papers (2020-02-18T09:27:41Z) - An Information-rich Sampling Technique over Spatio-Temporal CNN for
Classification of Human Actions in Videos [5.414308305392762]
We propose a novel scheme for human action recognition in videos, using a 3-dimensional Convolutional Neural Network (3D CNN) based classifier.
In this paper, a 3D CNN architecture is proposed to extract featuresweighted and follows Long Short-Term Memory (LSTM) to recognize human actions.
Experiments are performed with KTH and WEIZMANN human actions datasets, whereby it is shown to produce comparable results with the state-of-the-art techniques.
arXiv Detail & Related papers (2020-02-06T05:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.