Remote Sensing-Oriented World Model
- URL: http://arxiv.org/abs/2509.17808v2
- Date: Sat, 27 Sep 2025 15:53:26 GMT
- Title: Remote Sensing-Oriented World Model
- Authors: Yuxi Lu, Biao Wu, Zhidong Li, Kunqi Li, Chenya Huang, Huacan Wang, Qizhen Lan, Ronghao Chen, Ling Chen, Bin Liang,
- Abstract summary: World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations.<n>Existing approaches are predominantly evaluated in synthetic environments or constrained scene settings.<n>This paper bridges these gaps by introducing the first framework for world modeling in remote sensing.
- Score: 14.021235530589246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.
Related papers
- RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z) - RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments [15.243547292947397]
We introduce the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics.<n>We conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE.<n>The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts.
arXiv Detail & Related papers (2025-11-28T18:56:02Z) - IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation [56.43007596544299]
IndustryNav is the first dynamic industrial navigation benchmark for active spatial reasoning.<n>A study of nine state-of-the-art Visual Large Language Models reveals that closed-source models maintain a consistent advantage.
arXiv Detail & Related papers (2025-11-21T16:48:49Z) - OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios [39.58602686069029]
We introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries.<n>We also introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions.<n>Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU with consistent gains across four benchmarks.
arXiv Detail & Related papers (2025-11-21T04:23:04Z) - FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing [16.948824707021412]
Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images.<n>Existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery.<n>This paper proposes FUSAR-KLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines.
arXiv Detail & Related papers (2025-09-28T15:03:25Z) - Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition [0.0]
We investigate whether foundation models pretrained on remote sensing and general vision datasets can be effectively combined to improve performance.<n>The results show that feature-level ensembling of smaller pretrained models can match or exceed the performance of much larger models.<n>The study highlights the potential of applying knowledge distillation to transfer the strengths of ensembles into more compact models.
arXiv Detail & Related papers (2025-06-25T07:02:42Z) - TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z) - ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [64.86209459039313]
ThinkGeo is an agentic benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs on 486 structured agentic tasks with 1,773 expert-verified reasoning steps.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z) - LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [45.02469804709771]
We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
arXiv Detail & Related papers (2025-05-13T04:42:14Z) - Hearing Anywhere in Any Environment [33.566252963174556]
We present xRIR, a framework for cross-room Room Impulse Response (RIR) prediction.<n>The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples.<n> Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.
arXiv Detail & Related papers (2025-04-14T22:37:52Z) - OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing [57.050679160659705]
We introduce textbfOpenEarthSensing (OES), a large-scale fine-grained benchmark for open-world remote sensing.<n>OES includes 189 scene and object categories, covering the vast majority of potential semantic shifts that may occur in the real world.
arXiv Detail & Related papers (2025-02-28T02:49:52Z) - Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning [4.422649561583363]
We present a novel benchmark for assessing spatial reasoning in language models (LMs)
It is grounded in realistic 3D simulation data, offering a series of diverse room layouts with various objects and their spatial relationships.
A key contribution is our logic-based consistency-checking tool, which enables the assessment of multiple plausible solutions.
arXiv Detail & Related papers (2024-05-23T21:22:00Z) - Spatial Language Understanding for Object Search in Partially Observed
Cityscale Environments [21.528770932332474]
We introduce the spatial language observation space and formulate a model under the framework of Partially Observable Markov Decision Process (POMDP)
We propose a convolutional neural network model that learns to predict the language provider's relative frame of reference (FoR) given environment context.
We demonstrate the generalizability of our FoR prediction model and object search system through cross-validation over areas of five cities, each with a 40,000m$2$ footprint.
arXiv Detail & Related papers (2020-12-04T16:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.