Related papers: VidConv: A modernized 2D ConvNet for Efficient Video Recognition

VidConv: A modernized 2D ConvNet for Efficient Video Recognition

URL: http://arxiv.org/abs/2207.03782v1
Date: Fri, 8 Jul 2022 09:33:46 GMT
Title: VidConv: A modernized 2D ConvNet for Efficient Video Recognition
Authors: Chuong H. Nguyen, Su Huynh, Vinh Nguyen, Ngoc Nguyen
Abstract summary: Vision Transformers (ViT) have been steadily breaking the record for many vision tasks. ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition.
Score: 0.8070014188337304
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since being introduced in 2020, Vision Transformers (ViT) has been steadily breaking the record for many vision tasks and are often described as ``all-you-need" to replace ConvNet. Despite that, ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In addition, recent research shows that standard ConvNet if redesigned and trained appropriately can compete favorably with ViT in terms of accuracy and scalability. In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition. Particularly, our main target is to serve for industrial product deployment, such as FPGA boards in which only standard operations are supported. Therefore, our network simply consists of 2D convolutions, without using any 3D convolution, long-range attention plugin, or Transformer blocks. While being trained with much fewer epochs (5x-10x), our backbone surpasses the methods using (2+1)D and 3D convolution, and achieve comparable results with ViT on two benchmark datasets.

Related papers

3D-RCNet: Learning from Transformer to Build a 3D Relational ConvNet for Hyperspectral Image Classification [8.124761584272132]
We propose a 3D relational ConvNet named 3D-RCNet, which inherits both strengths of ConvNet and ViT. The proposed 3D-RCNet maintains the high computational efficiency of ConvNet while enjoying the flexibility of ViT. Empirical evaluations on three representative benchmark HSI datasets show that the proposed model outperforms previous ConvNet-based and ViT-based HSI approaches.
arXiv Detail & Related papers (2024-08-25T05:41:47Z)
Are Large Kernels Better Teachers than Transformers for ConvNets? [82.4742785108714]
This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small- Kernel ConvNets.
arXiv Detail & Related papers (2023-05-30T21:05:23Z)
Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features. By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z)
Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs [35.39701561076837]
We propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC. Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform. Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets.
arXiv Detail & Related papers (2022-10-08T13:14:02Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers [1.14219428942199]
We study CMTL from 2D to 3D sensor to explore the upper bound performance of 3D sensor only systems. While most CMTL pipelines from 2D to 3D vision are complicated and based on Convolutional Neural Networks (ConvNets), ours is easy to implement, expand and based on both ConvNets and Vision transformers (ViTs)
arXiv Detail & Related papers (2022-03-20T05:03:29Z)
A ConvNet for the 2020s [94.89735578018099]
Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model. It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
arXiv Detail & Related papers (2022-01-10T18:59:10Z)
ConvNets vs. Transformers: Whose Visual Representations are More Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations. We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z)
Rethinking the Design Principles of Robust Vision Transformer [28.538786330184642]
Vision Transformers (ViT) have shown that self-attention-based networks surpassed traditional convolution neural networks (CNNs) in most vision tasks. In this paper, we rethink the design principles of ViTs based on the robustness. By combining the robust design components, we propose Robust Vision Transformer (RVT)
arXiv Detail & Related papers (2021-05-17T15:04:15Z)
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [103.03973037619532]
This work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer(PVT) PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions.
arXiv Detail & Related papers (2021-02-24T08:33:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.