General-Purpose Multimodal Transformer meets Remote Sensing Semantic
Segmentation
- URL: http://arxiv.org/abs/2307.03388v1
- Date: Fri, 7 Jul 2023 04:58:34 GMT
- Title: General-Purpose Multimodal Transformer meets Remote Sensing Semantic
Segmentation
- Authors: Nhi Kieu, Kien Nguyen, Sridha Sridharan, Clinton Fookes
- Abstract summary: Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation.
Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance.
We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
- Score: 35.100738362291416
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM
(Digital Surface Model) information and many others has provided us with an
unprecedented wealth of data for Earth Observation. Multimodal AI seeks to
exploit those complementary data sources, particularly for complex tasks like
semantic segmentation. While specialized architectures have been developed,
they are highly complicated via significant effort in model design, and require
considerable re-engineering whenever a new modality emerges. Recent trends in
general-purpose multimodal networks have shown great potential to achieve
state-of-the-art performance across multiple multimodal tasks with one unified
architecture. In this work, we investigate the performance of PerceiverIO, one
in the general-purpose multimodal family, in the remote sensing semantic
segmentation domain. Our experiments reveal that this ostensibly universal
network struggles with object scale variation in remote sensing images and
fails to detect the presence of cars from a top-down view. To address these
issues, even with extreme class imbalance issues, we propose a spatial and
volumetric learning component. Specifically, we design a UNet-inspired module
that employs 3D convolution to encode vital local information and learn
cross-modal features simultaneously, while reducing network computational
burden via the cross-attention mechanism of PerceiverIO. The effectiveness of
the proposed component is validated through extensive experiments comparing it
with other methods such as 2D convolution, and dual local module (\ie the
combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed
method achieves competitive results with specialized architectures like
UNetFormer and SwinUNet, showing its potential to minimize network architecture
engineering with a minimal compromise on the performance.
Related papers
- MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation [8.443065903814821]
This study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation.
At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data.
This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data.
arXiv Detail & Related papers (2024-10-15T00:52:16Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - A Multitask Deep Learning Model for Classification and Regression of Hyperspectral Images: Application to the large-scale dataset [44.94304541427113]
We propose a multitask deep learning model to perform multiple classification and regression tasks simultaneously on hyperspectral images.
We validated our approach on a large hyperspectral dataset called TAIGA.
A comprehensive qualitative and quantitative analysis of the results shows that the proposed method significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-23T11:14:54Z) - RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks [11.681342476516267]
We propose a Remote Distributed Sensing Foundation Model (RS-DFM) based on generalized information mapping and interaction.
This model can realize online collaborative perception across multiple platforms and various downstream tasks.
We present a dual-branch information compression module to decouple high-frequency and low-frequency feature information.
arXiv Detail & Related papers (2024-06-11T07:46:47Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - ESDMR-Net: A Lightweight Network With Expand-Squeeze and Dual Multiscale
Residual Connections for Medical Image Segmentation [7.921517156237902]
This paper presents an expand-squeeze dual multiscale residual network ( ESDMR-Net)
It is a fully convolutional network that is well-suited for resource-constrained computing hardware such as mobile devices.
We present experiments on seven datasets from five distinct examples of applications.
arXiv Detail & Related papers (2023-12-17T02:15:49Z) - Bilateral Network with Residual U-blocks and Dual-Guided Attention for
Real-time Semantic Segmentation [18.393208069320362]
We design a new fusion mechanism for two-branch architecture which is guided by attention computation.
To be precise, we use the Dual-Guided Attention (DGA) module we proposed to replace some multi-scale transformations.
Experiments on Cityscapes and CamVid dataset show the effectiveness of our method.
arXiv Detail & Related papers (2023-10-31T09:20:59Z) - Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space.
We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Unpaired Multi-modal Segmentation via Knowledge Distillation [77.39798870702174]
We propose a novel learning scheme for unpaired cross-modality image segmentation.
In our method, we heavily reuse network parameters, by sharing all convolutional kernels across CT and MRI.
We have extensively validated our approach on two multi-class segmentation problems.
arXiv Detail & Related papers (2020-01-06T20:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.