HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications
- URL: http://arxiv.org/abs/2503.18540v1
- Date: Mon, 24 Mar 2025 10:49:55 GMT
- Title: HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications
- Authors: Guneet Mutreja, Philipp Schuegraf, Ksenia Bittner,
- Abstract summary: HiRes-FusedMIM is a novel pre-trained model specifically designed to leverage the rich information contained within high-resolution RGB and DSM data.<n>We conducted a comprehensive evaluation of HiRes-FusedMIM on a diverse set of downstream tasks, including classification, semantic segmentation, and instance segmentation.
- Score: 2.048226951354646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in self-supervised learning have led to the development of foundation models that have significantly advanced performance in various computer vision tasks. However, despite their potential, these models often overlook the crucial role of high-resolution digital surface models (DSMs) in understanding urban environments, particularly for building-level analysis, which is essential for applications like digital twins. To address this gap, we introduce HiRes-FusedMIM, a novel pre-trained model specifically designed to leverage the rich information contained within high-resolution RGB and DSM data. HiRes-FusedMIM utilizes a dual-encoder simple masked image modeling (SimMIM) architecture with a multi-objective loss function that combines reconstruction and contrastive objectives, enabling it to learn powerful, joint representations from both modalities. We conducted a comprehensive evaluation of HiRes-FusedMIM on a diverse set of downstream tasks, including classification, semantic segmentation, and instance segmentation. Our results demonstrate that: 1) HiRes-FusedMIM outperforms previous state-of-the-art geospatial methods on several building-related datasets, including WHU Aerial and LoveDA, demonstrating its effectiveness in capturing and leveraging fine-grained building information; 2) Incorporating DSMs during pre-training consistently improves performance compared to using RGB data alone, highlighting the value of elevation information for building-level analysis; 3) The dual-encoder architecture of HiRes-FusedMIM, with separate encoders for RGB and DSM data, significantly outperforms a single-encoder model on the Vaihingen segmentation task, indicating the benefits of learning specialized representations for each modality. To facilitate further research and applications in this direction, we will publicly release the trained model weights.
Related papers
- MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism [67.56918651825056]
We propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism.<n>Our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark.<n>A series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
arXiv Detail & Related papers (2025-03-03T12:19:06Z) - ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting [66.29782808719301]
Building articulated objects is a key challenge in computer vision.<n>Existing methods often fail to effectively integrate information across different object states.<n>We introduce ArtGS, a novel approach that leverages 3D Gaussians as a flexible and efficient representation.
arXiv Detail & Related papers (2025-02-26T10:25:32Z) - SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z) - YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary [12.39040757106137]
We introduce an innovative Retriever-Dictionary (RD) module to address this issue.<n>This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset.<n>Experiments show that using the RD significantly improves model performance, achieving more than a 3% increase in mean Average Precision for object detection.
arXiv Detail & Related papers (2024-10-20T09:38:58Z) - ATOMMIC: An Advanced Toolbox for Multitask Medical Imaging Consistency to facilitate Artificial Intelligence applications from acquisition to analysis in Magnetic Resonance Imaging [0.10434396204054465]
ATOMMIC is an open-source toolbox that streamlines AI applications for accelerated MRI reconstruction and analysis.
ATOMMIC implements several tasks using DL networks and enables MultiTask Learning (MTL) to perform related tasks integrated, targeting generalization in the MRI domain.
arXiv Detail & Related papers (2024-04-30T16:00:21Z) - State Space Models as Foundation Models: A Control Theoretic Overview [3.3222241150972356]
In recent years, there has been a growing interest in integrating linear state-space models (SSM) in deep neural network architectures.
This paper is intended as a gentle introduction to SSM-based architectures for control theorists.
It provides a systematic review of the most successful SSM proposals and highlights their main features from a control theoretic perspective.
arXiv Detail & Related papers (2024-03-25T16:10:47Z) - Robustness Analysis on Foundational Segmentation Models [28.01242494123917]
In this work, we perform a robustness analysis of Visual Foundation Models (VFMs) for segmentation tasks.
We benchmark seven state-of-the-art segmentation architectures using 2 different datasets.
Our findings reveal several key insights: VFMs exhibit vulnerabilities to compression-induced corruptions, despite not outpacing all of unimodal models in robustness, multimodal models show competitive resilience in zero-shot scenarios, and VFMs demonstrate enhanced robustness for certain object categories.
arXiv Detail & Related papers (2023-06-15T16:59:42Z) - GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for
Remote Sensing Data [27.63411386396492]
This paper introduces a new benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data.
The proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data.
arXiv Detail & Related papers (2023-05-24T09:03:18Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Siamese Network for RGB-D Salient Object Detection and Beyond [113.30063105890041]
A novel framework is proposed to learn from both RGB and depth inputs through a shared network backbone.
Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector.
We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models.
arXiv Detail & Related papers (2020-08-26T06:01:05Z) - Revealing the Invisible with Model and Data Shrinking for
Composite-database Micro-expression Recognition [49.463864096615254]
We analyze the influence of learning complexity, including the input complexity and model complexity.
We propose a recurrent convolutional network (RCN) to explore the shallower-architecture and lower-resolution input data.
We develop three parameter-free modules to integrate with RCN without increasing any learnable parameters.
arXiv Detail & Related papers (2020-06-17T06:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.