VidConv: A modernized 2D ConvNet for Efficient Video Recognition
- URL: http://arxiv.org/abs/2207.03782v1
- Date: Fri, 8 Jul 2022 09:33:46 GMT
- Title: VidConv: A modernized 2D ConvNet for Efficient Video Recognition
- Authors: Chuong H. Nguyen, Su Huynh, Vinh Nguyen, Ngoc Nguyen
- Abstract summary: Vision Transformers (ViT) have been steadily breaking the record for many vision tasks.
ViTs are generally computational, memory-consuming, and unfriendly for embedded devices.
In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition.
- Score: 0.8070014188337304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since being introduced in 2020, Vision Transformers (ViT) has been steadily
breaking the record for many vision tasks and are often described as
``all-you-need" to replace ConvNet. Despite that, ViTs are generally
computational, memory-consuming, and unfriendly for embedded devices. In
addition, recent research shows that standard ConvNet if redesigned and trained
appropriately can compete favorably with ViT in terms of accuracy and
scalability. In this paper, we adopt the modernized structure of ConvNet to
design a new backbone for action recognition. Particularly, our main target is
to serve for industrial product deployment, such as FPGA boards in which only
standard operations are supported. Therefore, our network simply consists of 2D
convolutions, without using any 3D convolution, long-range attention plugin, or
Transformer blocks. While being trained with much fewer epochs (5x-10x), our
backbone surpasses the methods using (2+1)D and 3D convolution, and achieve
comparable results with ViT on two benchmark datasets.
Related papers
- 3D-RCNet: Learning from Transformer to Build a 3D Relational ConvNet for Hyperspectral Image Classification [8.124761584272132]
We propose a 3D relational ConvNet named 3D-RCNet, which inherits both strengths of ConvNet and ViT.
The proposed 3D-RCNet maintains the high computational efficiency of ConvNet while enjoying the flexibility of ViT.
Empirical evaluations on three representative benchmark HSI datasets show that the proposed model outperforms previous ConvNet-based and ViT-based HSI approaches.
arXiv Detail & Related papers (2024-08-25T05:41:47Z) - Are Large Kernels Better Teachers than Transformers for ConvNets? [82.4742785108714]
This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small- Kernel ConvNets.
arXiv Detail & Related papers (2023-05-30T21:05:23Z) - Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs [35.39701561076837]
We propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC.
Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform.
Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets.
arXiv Detail & Related papers (2022-10-08T13:14:02Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - simCrossTrans: A Simple Cross-Modality Transfer Learning for Object
Detection with ConvNets or Vision Transformers [1.14219428942199]
We study CMTL from 2D to 3D sensor to explore the upper bound performance of 3D sensor only systems.
While most CMTL pipelines from 2D to 3D vision are complicated and based on Convolutional Neural Networks (ConvNets), ours is easy to implement, expand and based on both ConvNets and Vision transformers (ViTs)
arXiv Detail & Related papers (2022-03-20T05:03:29Z) - A ConvNet for the 2020s [94.89735578018099]
Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model.
It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone.
In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
arXiv Detail & Related papers (2022-01-10T18:59:10Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Rethinking the Design Principles of Robust Vision Transformer [28.538786330184642]
Vision Transformers (ViT) have shown that self-attention-based networks surpassed traditional convolution neural networks (CNNs) in most vision tasks.
In this paper, we rethink the design principles of ViTs based on the robustness.
By combining the robust design components, we propose Robust Vision Transformer (RVT)
arXiv Detail & Related papers (2021-05-17T15:04:15Z) - Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction
without Convolutions [103.03973037619532]
This work investigates a simple backbone network useful for many dense prediction tasks without convolutions.
Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer(PVT)
PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions.
arXiv Detail & Related papers (2021-02-24T08:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.