A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw
- URL: http://arxiv.org/abs/2504.17822v1
- Date: Wed, 23 Apr 2025 22:18:10 GMT
- Title: A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw
- Authors: Wenwen Li, Chia-Yu Hsu, Sizhe Wang, Zhining Gu, Yili Yang, Brendan M. Rogers, Anna Liljedahl,
- Abstract summary: Retro Thaw Slumps (RTS) in Arctic regions are distinct permafrost landforms with significant environmental impacts.<n>This paper employed a state-of-the-art deep learning model, the Mask R-CNN, to delineate RTS features across the Arctic.<n>Two new strategies were introduced to optimize multimodal learning and enhance the model's predictive performance.
- Score: 2.906027992527643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrogressive Thaw Slumps (RTS) in Arctic regions are distinct permafrost landforms with significant environmental impacts. Mapping these RTS is crucial because their appearance serves as a clear indication of permafrost thaw. However, their small scale compared to other landform features, vague boundaries, and spatiotemporal variation pose significant challenges for accurate detection. In this paper, we employed a state-of-the-art deep learning model, the Cascade Mask R-CNN with a multi-scale vision transformer-based backbone, to delineate RTS features across the Arctic. Two new strategies were introduced to optimize multimodal learning and enhance the model's predictive performance: (1) a feature-level, residual cross-modality attention fusion strategy, which effectively integrates feature maps from multiple modalities to capture complementary information and improve the model's ability to understand complex patterns and relationships within the data; (2) pre-trained unimodal learning followed by multimodal fine-tuning to alleviate high computing demand while achieving strong model performance. Experimental results demonstrated that our approach outperformed existing models adopting data-level fusion, feature-level convolutional fusion, and various attention fusion strategies, providing valuable insights into the efficient utilization of multimodal data for RTS mapping. This research contributes to our understanding of permafrost landforms and their environmental implications.
Related papers
- PolSAM: Polarimetric Scattering Mechanism Informed Segment Anything Model [76.95536611263356]
PolSAR data presents unique challenges due to its rich and complex characteristics.
Existing data representations, such as complex-valued data, polarimetric features, and amplitude images, are widely used.
Most feature extraction networks for PolSAR are small, limiting their ability to capture features effectively.
We propose the Polarimetric Scattering Mechanism-Informed SAM (PolSAM), an enhanced Segment Anything Model (SAM) that integrates domain-specific scattering characteristics and a novel prompt generation strategy.
arXiv Detail & Related papers (2024-12-17T09:59:53Z) - T-GMSI: A transformer-based generative model for spatial interpolation under sparse measurements [1.0931557410591526]
We propose a Transformer-based Generative Model for Spatial Interpolation (T-GMSI) using a vision transformer (ViT) architecture for digital elevation model (DEM) generation under sparse conditions.
T-GMSI replaces traditional convolution-based methods with ViT for feature extraction and DEM while incorporating a feature-aware loss function for enhanced accuracy.
T-GMSI excels in producing high-quality elevation surfaces from datasets with over 70% sparsity and demonstrates strong transferability across diverse landscapes without fine-tuning.
arXiv Detail & Related papers (2024-12-13T06:01:39Z) - HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution [6.546896650921257]
We propose HiTSR, a hierarchical transformer model for reference-based image super-resolution.
We streamline the architecture and training pipeline by incorporating the double attention block from GAN literature.
Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109.
arXiv Detail & Related papers (2024-08-30T01:16:29Z) - Federated Multi-Agent Mapping for Planetary Exploration [0.4143603294943439]
We propose a federated multi-agent mapping approach that jointly trains a global map model across agents without transmitting raw data.<n>Our method leverages implicit neural mapping to generate parsimonious, adaptable representations, reducing data transmission by up to 93.8% compared to raw maps.<n>We demonstrate the efficacy of our approach on Martian terrains and glacier datasets, achieving downstream path planning F1 scores as high as 0.95 while outperforming on map reconstruction losses.
arXiv Detail & Related papers (2024-04-02T20:32:32Z) - Segment Anything Model Can Not Segment Anything: Assessing AI Foundation
Model's Generalizability in Permafrost Mapping [19.307294875969827]
This paper introduces AI foundation models and their defining characteristics.
We evaluate the performance of large AI vision models, especially Meta's Segment Anything Model (SAM)
The results show that although promising, SAM still has room for improvement to support AI-augmented terrain mapping.
arXiv Detail & Related papers (2024-01-16T19:10:09Z) - SIRAN: Sinkhorn Distance Regularized Adversarial Network for DEM
Super-resolution using Discriminative Spatial Self-attention [5.178465447325005]
Digital Elevation Model (DEM) is an essential aspect in the remote sensing domain to analyze and explore different applications related to surface elevation information.
In this study, we intend to address the generation of high-resolution DEMs using high-resolution multi-spectral (MX) satellite imagery.
We present an objective function related to optimizing Sinkhorn distance with traditional GAN to improve the stability of adversarial learning.
arXiv Detail & Related papers (2023-11-27T12:03:22Z) - Learning transformer-based heterogeneously salient graph representation for multimodal remote sensing image classification [42.15709954199397]
A transformer-based heterogeneously salient graph representation (THSGR) approach is proposed in this paper.
First, a multimodal heterogeneous graph encoder is presented to encode distinctively non-Euclidean structural features from heterogeneous data.
A self-attention-free multi-convolutional modulator is designed for effective and efficient long-term dependency modeling.
arXiv Detail & Related papers (2023-11-17T04:06:20Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust
Road Extraction [110.61383502442598]
We introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet)
CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement.
Experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction.
arXiv Detail & Related papers (2021-11-30T04:30:10Z) - Light Field Reconstruction via Deep Adaptive Fusion of Hybrid Lenses [67.01164492518481]
This paper explores the problem of reconstructing high-resolution light field (LF) images from hybrid lenses.
We propose a novel end-to-end learning-based approach, which can comprehensively utilize the specific characteristics of the input.
Our framework could potentially decrease the cost of high-resolution LF data acquisition and benefit LF data storage and transmission.
arXiv Detail & Related papers (2021-02-14T06:44:47Z) - Revealing the Invisible with Model and Data Shrinking for
Composite-database Micro-expression Recognition [49.463864096615254]
We analyze the influence of learning complexity, including the input complexity and model complexity.
We propose a recurrent convolutional network (RCN) to explore the shallower-architecture and lower-resolution input data.
We develop three parameter-free modules to integrate with RCN without increasing any learnable parameters.
arXiv Detail & Related papers (2020-06-17T06:19:24Z) - Learning Deformable Image Registration from Optimization: Perspective,
Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation.
We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.