Filling the Gaps: A Multitask Hybrid Multiscale Generative Framework for Missing Modality in Remote Sensing Semantic Segmentation
- URL: http://arxiv.org/abs/2509.11102v1
- Date: Sun, 14 Sep 2025 05:40:35 GMT
- Title: Filling the Gaps: A Multitask Hybrid Multiscale Generative Framework for Missing Modality in Remote Sensing Semantic Segmentation
- Authors: Nhi Kieu, Kien Nguyen, Arnold Wiliem, Clinton Fookes, Sridha Sridharan,
- Abstract summary: Multimodal learning has shown significant performance boost compared to ordinary unimodal models.<n>In real-world scenarios, multimodal signals are susceptible to missing because of sensor failures and adverse weather conditions.<n>We propose a novel Generative-Enhanced MultiModal learning Network (GEMMNet) to tackle these limitations.
- Score: 28.992992584085787
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal learning has shown significant performance boost compared to ordinary unimodal models across various domains. However, in real-world scenarios, multimodal signals are susceptible to missing because of sensor failures and adverse weather conditions, which drastically deteriorates models' operation and performance. Generative models such as AutoEncoder (AE) and Generative Adversarial Network (GAN) are intuitive solutions aiming to reconstruct missing modality from available ones. Yet, their efficacy in remote sensing semantic segmentation remains underexplored. In this paper, we first examine the limitations of existing generative approaches in handling the heterogeneity of multimodal remote sensing data. They inadequately capture semantic context in complex scenes with large intra-class and small inter-class variation. In addition, traditional generative models are susceptible to heavy dependence on the dominant modality, introducing bias that affects model robustness under missing modality conditions. To tackle these limitations, we propose a novel Generative-Enhanced MultiModal learning Network (GEMMNet) with three key components: (1) Hybrid Feature Extractor (HyFEx) to effectively learn modality-specific representations, (2) Hybrid Fusion with Multiscale Awareness (HyFMA) to capture modality-synergistic semantic context across scales and (3) Complementary Loss (CoLoss) scheme to alleviate the inherent bias by encouraging consistency across modalities and tasks. Our method, GEMMNet, outperforms both generative baselines AE, cGAN (conditional GAN), and state-of-the-art non-generative approaches - mmformer and shaspec - on two challenging semantic segmentation remote sensing datasets (Vaihingen and Potsdam). Source code is made available.
Related papers
- SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data [31.146366498415784]
Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation.<n>IMSS faces three key challenges: multimodal imbalance, where dominant modalities suppress fragile ones; intra-class variation in scale, shape, and orientation across modalities; and cross-modal heterogeneity with conflicting cues producing inconsistent semantic responses.<n>We propose the Semantic-Guided Modality-Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra-class variation and reconciling cross-modal inconsistencies through semantic guidance.
arXiv Detail & Related papers (2026-03-03T01:28:21Z) - Understanding and Harnessing Sparsity in Unified Multimodal Models [32.09095929575726]
Large multimodal models have achieved remarkable progress in both understanding and generation.<n>Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework.<n>Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited.
arXiv Detail & Related papers (2025-12-02T02:47:29Z) - Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification [33.302856478333524]
Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects.<n>Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness.<n>We propose a Missing-aware Mixture-of-Loras framework that reformulates modality missing as a multi-task learning problem.
arXiv Detail & Related papers (2025-11-14T16:31:37Z) - UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation [104.59740403500132]
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance.<n>We propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC)<n>Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels.
arXiv Detail & Related papers (2025-09-19T17:29:25Z) - Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization [68.41367635546183]
Single Domain Generalization aims to train models with consistent performance across diverse scenarios using data from a single source.<n>We propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization.
arXiv Detail & Related papers (2025-03-17T18:08:03Z) - AMM-Diff: Adaptive Multi-Modality Diffusion Network for Missing Modality Imputation [2.8498944632323755]
In clinical practice, full imaging is not always feasible, often due to complex acquisition protocols, stringent privacy regulations, or specific clinical needs.<n>A promising solution is missing data imputation, where absent modalities are generated from available ones.<n>We propose an Adaptive Multi-Modality Diffusion Network (AMM-Diff), a novel diffusion-based generative model capable of handling any number of input modalities and generating the missing ones.
arXiv Detail & Related papers (2025-01-22T12:29:33Z) - Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark [69.02666229531322]
We introduce a pioneering study that investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD)<n>We find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation.<n>We propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR.
arXiv Detail & Related papers (2024-10-02T16:47:55Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Efficient Multimodal Transformer with Dual-Level Feature Restoration for
Robust Multimodal Sentiment Analysis [47.29528724322795]
Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently.
Despite significant progress, there are still two major challenges on the way towards robust MSA.
We propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR)
arXiv Detail & Related papers (2022-08-16T08:02:30Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - A Multi-Semantic Metapath Model for Large Scale Heterogeneous Network
Representation Learning [52.83948119677194]
We propose a multi-semantic metapath (MSM) model for large scale heterogeneous representation learning.
Specifically, we generate multi-semantic metapath-based random walks to construct the heterogeneous neighborhood to handle the unbalanced distributions.
We conduct systematical evaluations for the proposed framework on two challenging datasets: Amazon and Alibaba.
arXiv Detail & Related papers (2020-07-19T22:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.