A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers
- URL: http://arxiv.org/abs/2210.00843v1
- Date: Mon, 3 Oct 2022 12:08:09 GMT
- Title: A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers
- Authors: Georgios Tziafas, Hamidreza Kasaei
- Abstract summary: We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition.
We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Vision Transformer (ViT) architecture has recently established its place
in the computer vision literature, with multiple architectures for recognition
of image data or other visual modalities. However, training ViTs for RGB-D
object recognition remains an understudied topic, viewed in recent literature
only through the lens of multi-task pretraining in multiple modalities. Such
approaches are often computationally intensive and have not yet been applied
for challenging object-level classification tasks. In this work, we propose a
simple yet strong recipe for transferring pretrained ViTs in RGB-D domains for
single-view 3D object recognition, focusing on fusing RGB and depth
representations encoded jointly by the ViT. Compared to previous works in
multimodal Transformers, the key challenge here is to use the atested
flexibility of ViTs to capture cross-modal interactions at the downstream and
not the pretraining stage. We explore which depth representation is better in
terms of resulting accuracy and compare two methods for injecting RGB-D fusion
within the ViT architecture (i.e., early vs. late fusion). Our results in the
Washington RGB-D Objects dataset demonstrates that in such RGB $\rightarrow$
RGB-D scenarios, late fusion techniques work better than most popularly
employed early fusion. With our transfer baseline, adapted ViTs score up to
95.1\% top-1 accuracy in Washington, achieving new state-of-the-art results in
this benchmark. We additionally evaluate our approach with an open-ended
lifelong learning protocol, where we show that our adapted RGB-D encoder leads
to features that outperform unimodal encoders, even without explicit
fine-tuning. We further integrate our method with a robot framework and
demonstrate how it can serve as a perception utility in an interactive robot
learning scenario, both in simulation and with a real robot.
Related papers
- Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets [5.069884983892437]
We propose a simple yet effective multi-modal (RGB and depth) training framework called SurgDepth.
We show state-of-the-art (SOTA) results on all publicly available datasets applicable for this task.
We conduct extensive experiments on benchmark datasets including EndoVis2022, AutoLapro, LapI2I and EndoVis 2017.
arXiv Detail & Related papers (2024-07-29T05:35:51Z) - Efficient Multi-Task Scene Analysis with RGB-D Transformers [7.9011213682805215]
We introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks.
Our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
arXiv Detail & Related papers (2023-06-08T14:41:56Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - Unified Object Detector for Different Modalities based on Vision
Transformers [1.14219428942199]
We develop a unified detector that achieves superior performance across diverse modalities.
Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors.
We evaluate our unified model on the SUN RGB-D dataset, and demonstrate that it achieves similar or better performance in terms of mAP50.
arXiv Detail & Related papers (2022-07-03T16:01:04Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.