OmniSat: Self-Supervised Modality Fusion for Earth Observation
- URL: http://arxiv.org/abs/2404.08351v3
- Date: Wed, 17 Jul 2024 08:16:14 GMT
- Title: OmniSat: Self-Supervised Modality Fusion for Earth Observation
- Authors: Guillaume Astruc, Nicolas Gonthier, Clement Mallet, Loic Landrieu,
- Abstract summary: We introduce OmniSat, a novel architecture able to merge diverse EO modalities into expressive features without labels.
As demonstrated for three downstream tasks, OmniSat can learn rich representations without supervision, leading to state-of-the-art performances.
Our multimodal pretraining scheme improves performance even when only one modality is available for inference.
- Score: 5.767156832161819
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The diversity and complementarity of sensors available for Earth Observations (EO) calls for developing bespoke self-supervised multimodal learning approaches. However, current multimodal EO datasets and models typically focus on a single data type, either mono-date images or time series, which limits their impact. To address this issue, we introduce OmniSat, a novel architecture able to merge diverse EO modalities into expressive features without labels by exploiting their alignment. To demonstrate the advantages of our approach, we create two new multimodal datasets by augmenting existing ones with new modalities. As demonstrated for three downstream tasks -- forestry, land cover classification, and crop mapping -- OmniSat can learn rich representations without supervision, leading to state-of-the-art performances in semi- and fully supervised settings. Furthermore, our multimodal pretraining scheme improves performance even when only one modality is available for inference. The code and dataset are available at https://github.com/gastruc/OmniSat.
Related papers
- TerraMind: Large-Scale Generative Multimodality for Earth Observation [3.5472166810202457]
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation.
Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data.
arXiv Detail & Related papers (2025-04-15T13:17:39Z) - FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification [7.523866920738647]
This paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer)
The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge.
Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer.
arXiv Detail & Related papers (2025-03-18T01:45:51Z) - Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.
To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.
This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces [67.07083389543799]
We present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters.
Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together.
Experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications.
arXiv Detail & Related papers (2024-07-16T16:24:31Z) - Multi-Modal Video Dialog State Tracking in the Wild [10.453212911612866]
MST-MIXER is a novel video dialog model operating over a generic multi-modal state tracking scheme.
It predicts the missing underlying structure of the selected constituents of each input modality using a novel multi-modal graph structure learning method.
It achieves new state-of-the-art results on five challenging benchmarks.
arXiv Detail & Related papers (2024-07-02T12:34:17Z) - Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception [17.11366229887873]
We introduce a unified pretraining strategy, NeRF-Supervised Masked Auto (NS-MAE)
NS-MAE exploits NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data.
Results: NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality.
arXiv Detail & Related papers (2024-05-28T08:13:49Z) - MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning [9.540487697801531]
MMEarth is a diverse multi-modal pretraining dataset at global scale.
We propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images.
arXiv Detail & Related papers (2024-05-04T23:16:48Z) - MergeOcc: Bridge the Domain Gap between Different LiDARs for Robust Occupancy Prediction [8.993992124170624]
MergeOcc is developed to simultaneously handle different LiDARs by leveraging multiple datasets.
The effectiveness of MergeOcc is validated through experiments on two prominent datasets for autonomous vehicles.
arXiv Detail & Related papers (2024-03-13T13:23:05Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities.
We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG.
By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z) - Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations.
These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings.
We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z) - Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation
for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization.
We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain.
It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z) - Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder
with Semantic Concepts [0.9054540533394924]
Recent techniques try to learn a cross-modal mapping between the semantic space and the image space.
We propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space.
Our results show that our proposed model outperforms the current state-of-the-art approaches for generalized zero-shot learning.
arXiv Detail & Related papers (2021-06-26T20:08:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.