CT-Net: Channel Tensorization Network for Video Classification
- URL: http://arxiv.org/abs/2106.01603v1
- Date: Thu, 3 Jun 2021 05:35:43 GMT
- Title: CT-Net: Channel Tensorization Network for Video Classification
- Authors: Kunchang Li, Xianhang Li, Yali Wang, Jun Wang and Yu Qiao
- Abstract summary: 3D convolution is powerful for video classification but often computationally expensive.
Most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency.
We propose a concise and novel Channelization Network (CT-Net)
Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency.
- Score: 48.4482794950675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D convolution is powerful for video classification but often computationally
expensive, recent studies mainly focus on decomposing it on spatial-temporal
and/or channel dimensions. Unfortunately, most approaches fail to achieve a
preferable balance between convolutional efficiency and feature-interaction
sufficiency. For this reason, we propose a concise and novel Channel
Tensorization Network (CT-Net), by treating the channel dimension of input
feature as a multiplication of K sub-dimensions. On one hand, it naturally
factorizes convolution in a multiple dimension way, leading to a light
computation burden. On the other hand, it can effectively enhance feature
interaction from different channels, and progressively enlarge the 3D receptive
field of such interaction to boost classification accuracy. Furthermore, we
equip our CT-Module with a Tensor Excitation (TE) mechanism. It can learn to
exploit spatial, temporal and channel attention in a high-dimensional manner,
to improve the cooperative power of all the feature dimensions in our
CT-Module. Finally, we flexibly adapt ResNet as our CT-Net. Extensive
experiments are conducted on several challenging video benchmarks, e.g.,
Kinetics-400, Something-Something V1 and V2. Our CT-Net outperforms a number of
recent SOTA approaches, in terms of accuracy and/or efficiency. The codes and
models will be available on https://github.com/Andy1621/CT-Net.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Spatial-Spectral Hyperspectral Classification based on Learnable 3D
Group Convolution [18.644268589334217]
This paper proposes a learnable group convolution network (LGCNet) based on an improved 3D-DenseNet model and a lightweight model design.
The LGCNet module improves the shortcomings of group convolution by introducing a dynamic learning method for the input channels and convolution kernel grouping.
LGCNet has achieved progress in inference speed and accuracy, and outperforms mainstream hyperspectral image classification methods on the Indian Pines, Pavia University, and KSC datasets.
arXiv Detail & Related papers (2023-07-15T05:47:12Z) - DGCNet: An Efficient 3D-Densenet based on Dynamic Group Convolution for
Hyperspectral Remote Sensing Image Classification [22.025733502296035]
We introduce a lightweight model based on the improved 3D-Densenet model and designs DGCNet.
Multiple groups can capture different and complementary visual and semantic features of input images, allowing convolution neural network(CNN) to learn rich features.
The inference speed and accuracy have been improved, with outstanding performance on the IN, Pavia and KSC datasets.
arXiv Detail & Related papers (2023-07-13T10:19:48Z) - An Efficient Speech Separation Network Based on Recurrent Fusion Dilated
Convolution and Channel Attention [0.2538209532048866]
We present an efficient speech separation neural network, ARFDCN, which combines dilated convolutions, multi-scale fusion (MSF), and channel attention.
Experimental results indicate that the model achieves a decent balance between performance and computational efficiency.
arXiv Detail & Related papers (2023-06-09T13:30:27Z) - Gate-Shift-Fuse for Video Action Recognition [43.8525418821458]
Gate-Fuse (GSF) is a novel-temporal feature extraction module which controls interactions in-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner.
GSF can be inserted into existing 2D CNNs to convert them into efficient and high performing, with negligible parameter and compute overhead.
We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
arXiv Detail & Related papers (2022-03-16T19:19:04Z) - STSM: Spatio-Temporal Shift Module for Efficient Action Recognition [4.096670184726871]
We propose a plug-and-play Spatio-temporal Shift Module (STSM) that is both effective and high-performance.
In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features.
arXiv Detail & Related papers (2021-12-05T09:40:49Z) - Improving 3D Object Detection with Channel-wise Transformer [58.668922561622466]
We propose a two-stage 3D object detection framework (CT3D) with minimal hand-crafted design.
CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation.
It achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark.
arXiv Detail & Related papers (2021-08-23T02:03:40Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.