Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing
- URL: http://arxiv.org/abs/2601.17673v1
- Date: Sun, 25 Jan 2026 03:22:26 GMT
- Title: Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing
- Authors: Weiyu Zhang, Yuan Hu, Yong Li, Yu Liu,
- Abstract summary: Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse.<n>We propose Uni-RS, the first unified model tailored for remote sensing.<n>We show that our approach substantially improves spatial faithfulness in text-to-image generation.
- Score: 9.357861053928898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.
Related papers
- RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - Observing Health Outcomes Using Remote Sensing Imagery and Geo-Context Guided Visual Transformer [8.825339734603862]
We propose a novel model that enhances remote sensing imagery processing with guidance from auxiliary geospatial information.<n>Our approach introduces a geospatial embedding mechanism that transforms diverse geospatial data into embedding patches that are spatially aligned with image patches.<n>We show that the proposed framework outperforms existing pretrained geospatial foundation models in predicting disease prevalence.
arXiv Detail & Related papers (2026-01-26T22:45:28Z) - Dynamic Double Space Tower [4.553359878415195]
We study a brand-new approach to replace the attention mechanism in order to enhance the reasoning ability of the model.<n>Specifically, we propose a dynamic bidirectional spatial tower, which is divided into four layers to observe the image according to the principle of human gestalt vision.<n>This naturally provides a powerful structural prior for the spatial organization between entities, enabling the model to no longer blindly search for relationships between pixels.
arXiv Detail & Related papers (2025-06-13T01:27:45Z) - Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships.
We develop the REVISION framework which improves spatial fidelity in vision-language models.
Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.