Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
- URL: http://arxiv.org/abs/2503.12843v3
- Date: Wed, 26 Mar 2025 16:15:55 GMT
- Title: Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
- Authors: Haozhe Si, Yuxuan Wan, Minh Do, Deepak Vasisht, Han Zhao, Hendrik F. Hamann,
- Abstract summary: We introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations.<n>We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies.<n> Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models.
- Score: 14.104497777255137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Geospatial raster data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.
Related papers
- Few-Shot Video Object Segmentation in X-Ray Angiography Using Local Matching and Spatio-Temporal Consistency Loss [13.850743997507488]
We introduce a novel FSVOS model that employs a local matching strategy to restrict the search space to the most neighboring pixels.<n>Specifically, we implement non-parametric sampling mechanism that enables dynamically varying sampling regions.<n>This work offers enhanced potential for a wide range of clinical applications.
arXiv Detail & Related papers (2026-01-02T21:26:28Z) - 3dSAGER: Geospatial Entity Resolution over 3D Objects (Technical Report) [7.378893412842889]
3dSAGER is an end-to-end pipeline for geospatial entity resolution over 3D objects.<n>We present a novel, spatial-reference-independent featurization mechanism that captures intricate geometric characteristics of matching pairs.<n>We also propose a new lightweight and interpretable blocking method, BKAFI, that leverages a trained model to efficiently generate high-recall candidate sets.
arXiv Detail & Related papers (2025-11-09T09:35:45Z) - Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models [18.24287471339871]
Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands.<n>Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features.<n>Our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods.
arXiv Detail & Related papers (2025-09-24T13:32:07Z) - GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z) - Spatial Knowledge Graph-Guided Multimodal Synthesis [78.11669780958657]
We introduce a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation.<n>In experiments, data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly.<n>We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.
arXiv Detail & Related papers (2025-05-28T17:50:21Z) - On the use of Graphs for Satellite Image Time Series [3.2623791881739033]
This paper is an effort to examine the integration of graph-based methods in remote-sensing analysis.<n>It aims to present a versatile graph-based pipeline to tackle SITS analysis.<n>The paper includes a review and two case studies, which highlight the potential of graph-based approaches for land cover mapping and water forecasting datasets.
arXiv Detail & Related papers (2025-05-22T13:53:36Z) - Dynamic Cross-Modal Feature Interaction Network for Hyperspectral and LiDAR Data Classification [66.59320112015556]
Hyperspectral image (HSI) and LiDAR data joint classification is a challenging task.<n>We propose a novel Dynamic Cross-Modal Feature Interaction Network (DCMNet)<n>Our approach introduces three feature interaction blocks: Bilinear Spatial Attention Block (BSAB), Bilinear Channel Attention Block (BCAB), and Integration Convolutional Block (ICB)
arXiv Detail & Related papers (2025-03-10T05:50:13Z) - Hybrid State-Space and GRU-based Graph Tokenization Mamba for Hyperspectral Image Classification [14.250184447492208]
Hyperspectral image (HSI) classification plays a pivotal role in domains such as environmental monitoring, agriculture, and urban planning.
Traditional methods, including machine learning and convolutional neural networks (CNNs), often struggle to effectively capture these intricate spectral-spatial features.
This work proposes GraphMamba, a hybrid model that combines spectral-spatial token generation, graph-based token prioritization, and cross-attention mechanisms.
arXiv Detail & Related papers (2025-02-10T13:02:19Z) - HSLiNets: Hyperspectral Image and LiDAR Data Fusion Using Efficient Dual Non-Linear Feature Learning Networks [7.06787067270941]
The integration of hyperspectral imaging (HSI) and LiDAR data within new linear feature spaces offers a promising solution to the challenges posed by the high-dimensionality and redundancy inherent in HSIs.<n>This study introduces a dual linear fused space framework that capitalizes on bidirectional reversed convolutional neural network (CNN) pathways, coupled with a specialized spatial analysis block.<n>The proposed method not only enhances data processing and classification accuracy, but also mitigates the computational burden typically associated with advanced models such as Transformers.
arXiv Detail & Related papers (2024-11-30T01:08:08Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - Spatiotemporal Graph Learning with Direct Volumetric Information Passing and Feature Enhancement [62.91536661584656]
We propose a dual-module framework, Cell-embedded and Feature-enhanced Graph Neural Network (aka, CeFeGNN) for learning.<n>We embed learnable cell attributions to the common node-edge message passing process, which better captures the spatial dependency of regional features.<n>Experiments on various PDE systems and one real-world dataset demonstrate that CeFeGNN achieves superior performance compared with other baselines.
arXiv Detail & Related papers (2024-09-26T16:22:08Z) - GraphMamba: An Efficient Graph Structure Learning Vision Mamba for Hyperspectral Image Classification [19.740333867168108]
GraphMamba is an efficient graph structure learning vision Mamba classification framework to achieve deep spatial-spectral information mining.
The core components of GraphMamba include the HyperMamba module for improving computational efficiency and the SpectralGCN module for adaptive spatial context awareness.
arXiv Detail & Related papers (2024-07-11T07:56:08Z) - HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model [88.13261547704444]
Hyper SIGMA is a vision transformer-based foundation model for HSI interpretation.
It integrates spatial and spectral features using a specially designed spectral enhancement module.
It shows significant advantages in scalability, robustness, cross-modal transferring capability, and real-world applicability.
arXiv Detail & Related papers (2024-06-17T13:22:58Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z) - Hyperspectral Image Reconstruction via Combinatorial Embedding of
Cross-Channel Spatio-Spectral Clues [6.580484964018551]
Existing learning-based hyperspectral reconstruction methods show limitations in fully exploiting the information among the hyperspectral bands.
We propose to investigate the inter-dependencies in their respective hyperspectral space.
These embedded features can be fully exploited by querying the inter-channel correlations.
arXiv Detail & Related papers (2023-12-18T11:37:19Z) - Automated Spatio-Temporal Graph Contrastive Learning [18.245433428868775]
We develop an automated-temporal augmentation scheme with a parameterized contrastive view generator.
AutoST can adapt to the heterogeneous graph with multi-view semantics well preserved.
Experiments for three downstream-temporal mining tasks on several real-world datasets demonstrate the significant performance gain.
arXiv Detail & Related papers (2023-05-06T03:52:33Z) - Spatial-Spectral Clustering with Anchor Graph for Hyperspectral Image [88.60285937702304]
This paper proposes a novel unsupervised approach called spatial-spectral clustering with anchor graph (SSCAG) for HSI data clustering.
The proposed SSCAG is competitive against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-04-24T08:09:27Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - On the spatial attention in Spatio-Temporal Graph Convolutional Networks
for skeleton-based human action recognition [97.14064057840089]
Graphal networks (GCNs) promising performance in skeleton-based human action recognition by modeling a sequence of skeletons as a graph.
Most of the recently proposed G-temporal-based methods improve the performance by learning the graph structure at each layer of the network.
arXiv Detail & Related papers (2020-11-07T19:03:04Z) - Spatial-Spectral Residual Network for Hyperspectral Image
Super-Resolution [82.1739023587565]
We propose a novel spectral-spatial residual network for hyperspectral image super-resolution (SSRNet)
Our method can effectively explore spatial-spectral information by using 3D convolution instead of 2D convolution, which enables the network to better extract potential information.
In each unit, we employ spatial and temporal separable 3D convolution to extract spatial and spectral information, which not only reduces unaffordable memory usage and high computational cost, but also makes the network easier to train.
arXiv Detail & Related papers (2020-01-14T03:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.