TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal
Reasoning
- URL: http://arxiv.org/abs/2203.05928v1
- Date: Fri, 11 Mar 2022 13:58:05 GMT
- Title: TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal
Reasoning
- Authors: Shiwen Zhang
- Abstract summary: Current video classification benchmarks contain strong biases towards static features, thus cannot accurately reflect the temporal modeling ability.
New video classification benchmarks aiming to eliminate static biases are proposed, with experiments on these new benchmarks showing that the current clip-based 3D CNNs are outperformed by RNN structures and recent video transformers.
With TFC blocks inserted into Video-level 3D CNNs (V3D), our proposed TFCNets establish new state-of-the-art results on synthetic temporal reasoning benchmark, CATER, and real world static-unbiased dataset, Diving48, surpassing all previous methods.
- Score: 3.4570413826505564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Reasoning is one important functionality for vision intelligence. In
computer vision research community, temporal reasoning is usually studied in
the form of video classification, for which many state-of-the-art Neural
Network structures and dataset benchmarks are proposed in recent years,
especially 3D CNNs and Kinetics. However, some recent works found that current
video classification benchmarks contain strong biases towards static features,
thus cannot accurately reflect the temporal modeling ability. New video
classification benchmarks aiming to eliminate static biases are proposed, with
experiments on these new benchmarks showing that the current clip-based 3D CNNs
are outperformed by RNN structures and recent video transformers.
In this paper, we find that 3D CNNs and their efficient depthwise variants,
when video-level sampling strategy is used, are actually able to beat RNNs and
recent vision transformers by significant margins on static-unbiased temporal
reasoning benchmarks. Further, we propose Temporal Fully Connected Block (TFC
Block), an efficient and effective component, which approximates fully
connected layers along temporal dimension to obtain video-level receptive
field, enhancing the spatiotemporal reasoning ability. With TFC blocks inserted
into Video-level 3D CNNs (V3D), our proposed TFCNets establish new
state-of-the-art results on synthetic temporal reasoning benchmark, CATER, and
real world static-unbiased dataset, Diving48, surpassing all previous methods.
Related papers
- Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding [56.315932539150324]
We design a Unified Static and Dynamic Network (UniSDNet) to learn the semantic association between the video and text/audio queries.
Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks.
arXiv Detail & Related papers (2024-03-21T06:53:40Z) - F4D: Factorized 4D Convolutional Neural Network for Efficient
Video-level Representation Learning [4.123763595394021]
Most existing 3D convolutional neural network (CNN)-based methods for video-level representation learning are clip-based.
We propose a factorized 4D CNN architecture with attention (F4D) that is capable of learning more effective, finer-grained, long-termtemporal video representations.
arXiv Detail & Related papers (2023-11-28T19:21:57Z) - Temporal Coherent Test-Time Optimization for Robust Video Classification [55.432935503341064]
Deep neural networks are likely to fail when the test data is corrupted in real-world deployment.
Test-time optimization is an effective way that adapts models to robustness to corrupted data during testing.
We propose a framework to utilize temporal information in test-time optimization for robust classification.
arXiv Detail & Related papers (2023-02-28T04:59:23Z) - Gate-Shift-Fuse for Video Action Recognition [43.8525418821458]
Gate-Fuse (GSF) is a novel-temporal feature extraction module which controls interactions in-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner.
GSF can be inserted into existing 2D CNNs to convert them into efficient and high performing, with negligible parameter and compute overhead.
We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
arXiv Detail & Related papers (2022-03-16T19:19:04Z) - Continual 3D Convolutional Neural Networks for Real-time Processing of
Videos [93.73198973454944]
We introduce Continual 3D Contemporalal Neural Networks (Co3D CNNs)
Co3D CNNs process videos frame-by-frame rather than by clip by clip.
We show that Co3D CNNs initialised on the weights from preexisting state-of-the-art video recognition models reduce floating point operations for frame-wise computations by 10.0-12.4x while improving accuracy on Kinetics-400 by 2.3-3.8x.
arXiv Detail & Related papers (2021-05-31T18:30:52Z) - 3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture.
SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z) - Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video
Processing [15.980090046426193]
Conal Neural Networks with 3D kernels (3D-CNNs) currently achieve state-of-the-art results in video recognition tasks.
We propose dissected 3D-CNNs, where the intermediate volumes of the network are dissected and propagated over depth (time) dimension for future calculations.
For action classification, the dissected version of ResNet models performs 77-90% fewer computations at online operation.
arXiv Detail & Related papers (2020-09-30T12:48:52Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - V4D:4D Convolutional Neural Networks for Video-level Representation
Learning [58.548331848942865]
Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features.
We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions.
V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
arXiv Detail & Related papers (2020-02-18T09:27:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.