Related papers: Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

URL: http://arxiv.org/abs/2509.19087v1
Date: Tue, 23 Sep 2025 14:40:52 GMT
Title: Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications
Authors: Ganesh Mallya, Yotam Gigi, Dahun Kim, Maxim Neumann, Genady Beryozkin, Tomer Shekel, Anelia Angelova,
Abstract summary: We propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode.<n>We observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification.
Score: 18.162960869640646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-spectral imagery plays a crucial role in diverse Remote Sensing applications including land-use classification, environmental monitoring and urban planning. These images are widely adopted because their additional spectral bands correlate strongly with physical materials on the ground, such as ice, water, and vegetation. This allows for more accurate identification, and their public availability from missions, such as Sentinel-2 and Landsat, only adds to their value. Currently, the automatic analysis of such data is predominantly managed through machine learning models specifically trained for multi-spectral input, which are costly to train and support. Furthermore, although providing a lot of utility for Remote Sensing, such additional inputs cannot be used with powerful generalist large multimodal models, which are capable of solving many visual problems, but are not able to understand specialized multi-spectral signals. To address this, we propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode, as inputs to generalist multimodal models, trained on RGB-only inputs. Our approach leverages the multimodal models' understanding of the visual space, and proposes to adapt to inputs to that space, and to inject domain-specific information as instructions into the model. We exemplify this idea with the Gemini2.5 model and observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification and demonstrate the easy adaptability of Gemini2.5 to new inputs. These results highlight the potential for geospatial professionals, working with non-standard specialized inputs, to easily leverage powerful multimodal models, such as Gemini2.5, to accelerate their work, benefiting from their rich reasoning and contextual capabilities, grounded in the specialized sensor data.

Related papers

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z)
Galileo: Learning Global & Local Features of Many Remote Sensing Modalities [34.71460539414284]
We present a novel self-supervised learning algorithm that extracts multi-scale features across a flexible set of input modalities through masked modeling.<n>Our Galileo is a single generalist model that outperforms SoTA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks.
arXiv Detail & Related papers (2025-02-13T14:21:03Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [79.58755811919366]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Bridging Remote Sensors with Multisensor Geospatial Foundation Models [15.289711240431107]
msGFM is a multisensor geospatial foundation model that unifies data from four key sensor modalities. For data originating from identical geolocations, our model employs an innovative cross-sensor pretraining approach. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks.
arXiv Detail & Related papers (2024-04-01T17:30:56Z)
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z)
MEM: Multi-Modal Elevation Mapping for Robotics and Learning [10.476978089902818]
We extend a 2.5D robot-centric elevation mapping framework by fusing multi-modal information from multiple sources into a popular map representation. Our system is designed to run on the GPU, making it real-time capable for various robotic and learning tasks.
arXiv Detail & Related papers (2023-09-28T19:55:29Z)
General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance. We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z)
Inertial Hallucinations -- When Wearable Inertial Devices Start Seeing Things [82.15959827765325]
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time.
arXiv Detail & Related papers (2022-07-14T10:04:18Z)
SEnSeI: A Deep Learning Module for Creating Sensor Independent Cloud Masks [0.7340845393655052]
We introduce a novel neural network architecture -- Spectral ENcoder for SEnsor Independence (SEnSeI) We focus on the problem of cloud masking, using several pre-existing datasets, and a new, freely available dataset for Sentinel-2. Our model is shown to achieve state-of-the-art performance on the satellites it was trained on (Sentinel-2 and Landsat 8), and is able to extrapolate to sensors it has not seen during training such as Landsat 7, Per'uSat-1, and Sentinel-3 SLSTR.
arXiv Detail & Related papers (2021-11-16T10:47:10Z)
Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.