Related papers: Task-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation

Task-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation

URL: http://arxiv.org/abs/2601.12052v1
Date: Sat, 17 Jan 2026 13:32:38 GMT
Title: Task-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation
Authors: Zaiyan Zhang, Jie Li, Shaowei Shi, Qiangqiang Yuan,
Abstract summary: TDP-CR is a task-driven framework that jointly performs cloud removal and land-cover segmentation.<n>Central to our approach is a Prompt-Guided Fusion mechanism, which utilizes a learnable degradation prompt to encode cloud thickness and spatial uncertainty.<n>Experiments on the LuojiaSET-OSFCR dataset demonstrate the superiority of our framework.
Score: 11.468907022707013
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optical remote sensing imagery is indispensable for Earth observation, yet persistent cloud occlusion limits its downstream utility. Most cloud removal (CR) methods are optimized for low-level fidelity and can over-smooth textures and boundaries that are critical for analysis-ready data (ARD), leading to a mismatch between visually plausible restoration and semantic utility. To bridge this gap, we propose TDP-CR, a task-driven multimodal framework that jointly performs cloud removal and land-cover segmentation. Central to our approach is a Prompt-Guided Fusion (PGF) mechanism, which utilizes a learnable degradation prompt to encode cloud thickness and spatial uncertainty. By combining global channel context with local prompt-conditioned spatial bias, PGF adaptively integrates Synthetic Aperture Radar (SAR) information only where optical data is corrupted. We further introduce a parameter-efficient two-phase training strategy that decouples reconstruction and semantic representation learning. Experiments on the LuojiaSET-OSFCR dataset demonstrate the superiority of our framework: TDP-CR surpasses heavy state-of-the-art baselines by 0.18 dB in PSNR while using only 15\% of the parameters, and achieves a 1.4\% improvement in mIoU consistently against multi-task competitors, effectively delivering analysis-ready data.

Related papers

ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning [85.20505958752928]
Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment.<n>RFT often introduce visual hallucinations like over-optimized details and semantic misalignment.<n>This work preliminarily explores why visual hallucinations arise and how to reduce them.
arXiv Detail & Related papers (2026-02-03T11:49:46Z)
Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration [52.82397287366076]
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework.<n>In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information.<n>Our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections.
arXiv Detail & Related papers (2025-12-11T12:20:31Z)
TriFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing [0.0]
Autoencoders offer a natural framework for denoising and reconstruction, but their performance degrades under challenging real-world conditions.<n>We propose TriFusion-AE, a cross-attention autoencoder that integrates textual priors, monocular depth maps from multi-view images, and LiDAR point clouds to improve robustness.<n>Our model achieves significantly more robust reconstruction under strong adversarial attacks and heavy noise, where CNN-based autoencoders collapse.
arXiv Detail & Related papers (2025-09-23T07:37:28Z)
Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix [0.5013248430919224]
We introduce a redundancy index, denoted rho(C), that directly quantifies inter-dimensional dependencies.<n>Low rho(C) reliably predicts high classification accuracy or low reconstruction error, while elevated redundancy is associated with performance collapse.<n>We show that Tree-structured Parzen Estimators (TPE) preferentially explore low-rho regions, suggesting that rho(C) can guide neural architecture search and serve as a redundancy-aware regularization target.
arXiv Detail & Related papers (2025-09-08T03:36:47Z)
Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection [12.743278093269325]
We propose a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet)<n>DUGC is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance.<n>MCF uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features.
arXiv Detail & Related papers (2025-08-28T04:31:48Z)
Minimal High-Resolution Patches Are Sufficient for Whole Slide Image Representation via Cascaded Dual-Scale Reconstruction [13.897013242536849]
Whole-slide image (WSI) analysis remains challenging due to gigapixel scale and sparsely distributed diagnostic regions.<n>We propose a Cascaded Dual-Scale Reconstruction framework, demonstrating that only an average of 9 high-resolution patches per WSI are sufficient for robust slide-level representation.
arXiv Detail & Related papers (2025-08-03T08:01:30Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
SING: Semantic Image Communications using Null-Space and INN-Guided Diffusion Models [52.40011613324083]
Joint source-channel coding systems (DeepJSCC) have recently demonstrated remarkable performance in wireless image transmission.<n>Existing methods focus on minimizing distortion between the transmitted image and the reconstructed version at the receiver, often overlooking perceptual quality.<n>We propose SING, a novel framework that formulates the recovery of high-quality images from corrupted reconstructions as an inverse problem.
arXiv Detail & Related papers (2025-03-16T12:32:11Z)
Multi-scale Restoration of Missing Data in Optical Time-series Images with Masked Spatial-Temporal Attention Network [0.6675733925327885]
Existing methods for imputing missing values in remote sensing images fail to fully exploit auxiliary information. This paper proposes a deep learning-based novel approach called MS2 for reconstructing time-series remote sensing images.
arXiv Detail & Related papers (2024-06-19T09:05:05Z)
Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining. A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery. Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z)
Unpaired Adversarial Learning for Single Image Deraining with Rain-Space Contrastive Constraints [61.40893559933964]
We develop an effective unpaired SID method which explores mutual properties of the unpaired exemplars by a contrastive learning manner in a GAN framework, named as CDR-GAN. Our method performs favorably against existing unpaired deraining approaches on both synthetic and real-world datasets, even outperforms several fully-supervised or semi-supervised models.
arXiv Detail & Related papers (2021-09-07T10:00:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.