CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation
- URL: http://arxiv.org/abs/2601.03490v1
- Date: Wed, 07 Jan 2026 01:02:39 GMT
- Title: CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation
- Authors: Yuzhe Sun, Zhe Dong, Haochen Jiang, Tianzhu Liu, Yanfeng Gu,
- Abstract summary: Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery.<n>Existing methods typically employ uniform fusion and refinement strategies across the entire image.<n>We propose an textbfuncertainty-guided framework that explicitly leverages a pixel-wise Referrbfreferring uncertainty map as a spatial prior to orchestrate adaptive inference.
- Score: 8.834663340762562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery. However, due to extreme scale variations, dense similar distractors, and intricate boundary structures, the reliability of cross-modal alignment exhibits significant \textbf{spatial non-uniformity}. Existing methods typically employ uniform fusion and refinement strategies across the entire image, which often introduces unnecessary linguistic perturbations in visually clear regions while failing to provide sufficient disambiguation in confused areas. To address this, we propose an \textbf{uncertainty-guided framework} that explicitly leverages a pixel-wise \textbf{referring uncertainty map} as a spatial prior to orchestrate adaptive inference. Specifically, we introduce a plug-and-play \textbf{Referring Uncertainty Scorer (RUS)}, which is trained via an online error-consistency supervision strategy to interpretably predict the spatial distribution of referential ambiguity. Building on this prior, we design two plug-and-play modules: 1) \textbf{Uncertainty-Gated Fusion (UGF)}, which dynamically modulates language injection strength to enhance constraints in high-uncertainty regions while suppressing noise in low-uncertainty ones; and 2) \textbf{Uncertainty-Driven Local Refinement (UDLR)}, which utilizes uncertainty-derived soft masks to focus refinement on error-prone boundaries and fine details. Extensive experiments demonstrate that our method functions as a unified, plug-and-play solution that significantly improves robustness and geometric fidelity in complex remote sensing scenes without altering the backbone architecture.
Related papers
- FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model [11.08248067961235]
FOCA is a large language model-based framework that integrates discriminative features from both the RGB spatial and frequency domains.<n>FSE-Set is a large-scale dataset with diverse authentic and tampered images, pixel-level masks, and dual-domain annotations.
arXiv Detail & Related papers (2026-02-21T15:53:44Z) - Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge [59.247871132422006]
Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations.<n>We propose the Self-Supervised Semantic Bridge ( SSB), a versatile framework that integrates external semantic priors into diffusion bridge models.<n>Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure.
arXiv Detail & Related papers (2026-02-18T18:05:00Z) - Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery [1.0742675209112622]
Training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs)<n>We propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI.<n> Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches.
arXiv Detail & Related papers (2026-01-29T01:46:03Z) - A Dual-Branch Local-Global Framework for Cross-Resolution Land Cover Mapping [16.429154404656412]
Cross-resolution land cover mapping aims to produce high-resolution semantic predictions from coarse or low-resolution supervision.<n>Existing weakly supervised approaches often struggle to align fine-grained spatial structures with coarse labels.<n>We propose DDTM, a dual-branch weakly supervised framework that explicitly decouples local semantic refinement from global contextual reasoning.
arXiv Detail & Related papers (2025-12-23T02:32:02Z) - UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction [83.48950950780554]
Building extraction from remote sensing images is a challenging task due to the complex structure variations of buildings.<n>Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models.<n>We present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet) to exploit high-quality global-local visual semantics.
arXiv Detail & Related papers (2025-12-15T02:59:16Z) - Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition [41.77490816513839]
We propose a novel method for zero-shot skeleton action recognition, termed $texttt$textbfFlora$$.<n>Specifically, we attune textual semantics by incorporating direction-aware regional semantics, and a cross-modal consistency objective.<n>Experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data.
arXiv Detail & Related papers (2025-11-12T14:54:53Z) - Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling [17.78769812974246]
Fine-grained image-text alignment is a pivotal challenge in multimodal learning.<n>We propose a unified approach that incorporates significance-aware and region-level uncertainty modeling.<n>Our approach achieves state-of-the-art performance across various backbone architectures.
arXiv Detail & Related papers (2025-11-11T00:28:11Z) - Terrain-Enhanced Resolution-aware Refinement Attention for Off-Road Segmentation [0.7734726150561086]
Designs that fuse only at low resolution blur edges and propagate local errors.<n>We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision.
arXiv Detail & Related papers (2025-11-03T10:36:57Z) - Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts [80.32933059529135]
Test-Time Adaptation (TTA) methods have emerged to adapt to target distributions during inference.<n>We propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD.<n>In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues.
arXiv Detail & Related papers (2025-08-28T07:09:21Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models [10.213390634031049]
Current IIL methods face two main challenges: a tendency towards overconfidence and difficulty in detecting subtle tampering boundaries.<n>We propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models.<n>Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions.
arXiv Detail & Related papers (2025-01-06T07:32:12Z) - Progressive Feature Self-reinforcement for Weakly Supervised Semantic
Segmentation [55.69128107473125]
We propose a single-stage approach for Weakly Supervised Semantic (WSSS) with image-level labels.
We adaptively partition the image content into deterministic regions (e.g., confident foreground and background) and uncertain regions (e.g., object boundaries and misclassified categories) for separate processing.
Building upon this, we introduce a complementary self-enhancement method that constrains the semantic consistency between these confident regions and an augmented image with the same class labels.
arXiv Detail & Related papers (2023-12-14T13:21:52Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.