simCrossTrans: A Simple Cross-Modality Transfer Learning for Object
Detection with ConvNets or Vision Transformers
- URL: http://arxiv.org/abs/2203.10456v1
- Date: Sun, 20 Mar 2022 05:03:29 GMT
- Title: simCrossTrans: A Simple Cross-Modality Transfer Learning for Object
Detection with ConvNets or Vision Transformers
- Authors: Xiaoke Shen, Ioannis Stamos
- Abstract summary: We study CMTL from 2D to 3D sensor to explore the upper bound performance of 3D sensor only systems.
While most CMTL pipelines from 2D to 3D vision are complicated and based on Convolutional Neural Networks (ConvNets), ours is easy to implement, expand and based on both ConvNets and Vision transformers (ViTs)
- Score: 1.14219428942199
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transfer learning is widely used in computer vision (CV), natural language
processing (NLP) and achieves great success. Most transfer learning systems are
based on the same modality (e.g. RGB image in CV and text in NLP). However, the
cross-modality transfer learning (CMTL) systems are scarce. In this work, we
study CMTL from 2D to 3D sensor to explore the upper bound performance of 3D
sensor only systems, which play critical roles in robotic navigation and
perform well in low light scenarios. While most CMTL pipelines from 2D to 3D
vision are complicated and based on Convolutional Neural Networks (ConvNets),
ours is easy to implement, expand and based on both ConvNets and Vision
transformers(ViTs): 1) By converting point clouds to pseudo-images, we can use
an almost identical network from pre-trained models based on 2D images. This
makes our system easy to implement and expand. 2) Recently ViTs have been
showing good performance and robustness to occlusions, one of the key reasons
for poor performance of 3D vision systems. We explored both ViT and ConvNet
with similar model sizes to investigate the performance difference. We name our
approach simCrossTrans: simple cross-modality transfer learning with ConvNets
or ViTs. Experiments on SUN RGB-D dataset show: with simCrossTrans we achieve
$13.2\%$ and $16.1\%$ absolute performance gain based on ConvNets and ViTs
separately. We also observed the ViTs based performs $9.7\%$ better than the
ConvNets one, showing the power of simCrossTrans with ViT. simCrossTrans with
ViTs surpasses the previous state-of-the-art (SOTA) by a large margin of
$+15.4\%$ mAP50. Compared with the previous 2D detection SOTA based RGB images,
our depth image only system only has a $1\%$ gap. The code, training/inference
logs and models are publicly available at
https://github.com/liketheflower/simCrossTrans
Related papers
- RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition.
We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z) - Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer? [111.11502241431286]
Vision Transformers (ViTs) have proven to be effective in solving 2D image understanding tasks.
ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable.
This paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture.
arXiv Detail & Related papers (2022-09-15T03:34:58Z) - VidConv: A modernized 2D ConvNet for Efficient Video Recognition [0.8070014188337304]
Vision Transformers (ViT) have been steadily breaking the record for many vision tasks.
ViTs are generally computational, memory-consuming, and unfriendly for embedded devices.
In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition.
arXiv Detail & Related papers (2022-07-08T09:33:46Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - MFEViT: A Robust Lightweight Transformer-based Network for Multimodal
2D+3D Facial Expression Recognition [1.7448845398590227]
Vision transformer (ViT) has been widely applied in many areas due to its self-attention mechanism.
We propose a robust lightweight pure transformer-based network for multimodal 2D+3D FER, namely MFEViT.
Our MFEViT outperforms state-of-the-art approaches with an accuracy of 90.83% on BU-3DFE and 90.28% on Bosphorus.
arXiv Detail & Related papers (2021-09-20T17:19:39Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.