Efficient Multi-Task Scene Analysis with RGB-D Transformers
- URL: http://arxiv.org/abs/2306.05242v1
- Date: Thu, 8 Jun 2023 14:41:56 GMT
- Title: Efficient Multi-Task Scene Analysis with RGB-D Transformers
- Authors: S\"ohnke Benedikt Fischedick, Daniel Seichter, Robin Schmidt, Leonard
Rabes, and Horst-Michael Gross
- Abstract summary: We introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks.
Our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
- Score: 7.9011213682805215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene analysis is essential for enabling autonomous systems, such as mobile
robots, to operate in real-world environments. However, obtaining a
comprehensive understanding of the scene requires solving multiple tasks, such
as panoptic segmentation, instance orientation estimation, and scene
classification. Solving these tasks given limited computing and battery
capabilities on mobile platforms is challenging. To address this challenge, we
introduce an efficient multi-task scene analysis approach, called EMSAFormer,
that uses an RGB-D Transformer-based encoder to simultaneously perform the
aforementioned tasks. Our approach builds upon the previously published
EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be
replaced with a single Transformer-based encoder. To achieve this, we
investigate how information from both RGB and depth data can be effectively
incorporated in a single encoder. To accelerate inference on robotic hardware,
we provide a custom NVIDIA TensorRT extension enabling highly optimization for
our EMSAFormer approach. Through extensive experiments on the commonly used
indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach
achieves state-of-the-art performance while still enabling inference with up to
39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
Related papers
- Are We Ready for Real-Time LiDAR Semantic Segmentation in Autonomous Driving? [42.348499880894686]
Scene semantic segmentation can be achieved by directly integrating 3D spatial data with specialized deep neural networks.
We investigate various 3D semantic segmentation methodologies and analyze their performance and capabilities for resource-constrained inference on embedded NVIDIA Jetson platforms.
arXiv Detail & Related papers (2024-10-10T20:47:33Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - Multi-objective Differentiable Neural Architecture Search [58.67218773054753]
We propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics.
Our method outperforms existing MOO NAS methods across a broad range of qualitatively different search spaces and datasets.
arXiv Detail & Related papers (2024-02-28T10:09:04Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - Dynamic Grained Encoder for Vision Transformers [150.02797954201424]
This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2023-01-10T07:55:29Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Robust Double-Encoder Network for RGB-D Panoptic Segmentation [31.807572107839576]
Panoptic segmentation provides an interpretation of the scene by computing a pixelwise semantic label together with instance IDs.
We propose a novel encoder-decoder neural network that processes RGB and depth separately through two encoders.
We show that our approach achieves superior results compared to other common approaches for panoptic segmentation.
arXiv Detail & Related papers (2022-10-06T11:46:37Z) - A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition.
We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z) - Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments [13.274695420192884]
We propose an efficient multi-task approach for RGB-D scene analysis(EMSANet)
We show that all tasks can be accomplished using a single neural network in real time on a mobile platform without diminishing performance.
We are the first to provide results in such a comprehensive multi-task setting for indoor scene analysis on NYUv2 and SUNRGB-D.
arXiv Detail & Related papers (2022-07-10T20:03:38Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis [16.5390740005143]
We propose an efficient and robust RGB-D segmentation approach that can be optimized to a high degree using NVIDIART.
We show that RGB-D segmentation is superior to processing RGB images solely and that it can still be performed in real time if the network architecture is carefully designed.
arXiv Detail & Related papers (2020-11-13T15:17:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.