Related papers: Spatial Context Improves the Integration of Text with Remote Sensing for Mapping Environmental Variables

Spatial Context Improves the Integration of Text with Remote Sensing for Mapping Environmental Variables

URL: http://arxiv.org/abs/2601.08750v1
Date: Tue, 13 Jan 2026 17:27:16 GMT
Title: Spatial Context Improves the Integration of Text with Remote Sensing for Mapping Environmental Variables
Authors: Valerie Zermatten, Chiara Vanalli, Gencer Sumbul, Diego Marcos, Devis Tuia,
Abstract summary: We propose an attention-based approach that combines aerial imagery and geolocated text within a spatial neighbourhood.<n>Our model is evaluated on the task of predicting 103 environmental variables from the SWECO25 data cube.
Score: 19.670023742796136
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent developments in natural language processing highlight text as an emerging data source for ecology. Textual resources carry unique information that can be used in complementarity with geospatial data sources, thus providing insights at the local scale into environmental conditions and properties hidden from more traditional data sources. Leveraging textual information in a spatial context presents several challenges. First, the contribution of textual data remains poorly defined in an ecological context, and it is unclear for which tasks it should be incorporated. Unlike ubiquitous satellite imagery or environmental covariates, the availability of textual data is sparse and irregular; its integration with geospatial data is not straightforward. In response to these challenges, this work proposes an attention-based approach that combines aerial imagery and geolocated text within a spatial neighbourhood, i.e. integrating contributions from several nearby observations. Our approach combines vision and text representations with a geolocation encoding, with an attention-based module that dynamically selects spatial neighbours that are useful for predictive tasks.The proposed approach is applied to the EcoWikiRS dataset, which combines high-resolution aerial imagery with sentences extracted from Wikipedia describing local environmental conditions across Switzerland. Our model is evaluated on the task of predicting 103 environmental variables from the SWECO25 data cube. Our approach consistently outperforms single-location or unimodal, i.e. image-only or text-only, baselines. When analysing variables by thematic groups, results show a significant improvement in performance for climatic, edaphic, population and land use/land cover variables, underscoring the benefit of including the spatial context when combining text and image data.

Related papers

Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale [41.693908591580175]
We develop vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder.<n>Our models achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization.
arXiv Detail & Related papers (2025-06-13T17:57:18Z)
Combining Observational Data and Language for Species Range Estimation [63.65684199946094]
We propose a novel approach combining millions of citizen science species observations with textual descriptions from Wikipedia.<n>Our framework maps locations, species, and text descriptions into a common space, enabling zero-shot range estimation from textual descriptions.<n>Our approach also acts as a strong prior when combined with observational data, resulting in more accurate range estimation with less data.
arXiv Detail & Related papers (2024-10-14T17:22:55Z)
Self-consistent Deep Geometric Learning for Heterogeneous Multi-source Spatial Point Data Prediction [10.646376827353551]
Multi-source spatial point data prediction is crucial in fields like environmental monitoring and natural resource management. Existing models in this area often fall short due to their domain-specific nature and lack a strategy for integrating information from various sources. We introduce an innovative multi-source spatial point data prediction framework that adeptly aligns information from varied sources without relying on ground truth labels.
arXiv Detail & Related papers (2024-06-30T16:13:13Z)
Into the Unknown: Generating Geospatial Descriptions for New Environments [18.736071151303726]
Rendezvous task requires reasoning over allocentric spatial relationships. Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text. We propose a large-scale augmentation method for generating high-quality synthetic data for new environments.
arXiv Detail & Related papers (2024-06-28T14:56:21Z)
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets. We introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z)
FREE: The Foundational Semantic Recognition for Modeling Environmental Ecosystems [56.0640340392818]
We introduce a framework, FREE, that enables the use of varying features and available information to train a universal model.<n>The core idea is to map available environmental data into a text space and then convert the traditional predictive modeling task in environmental science to a semantic recognition problem.<n>Our evaluation on two societally important real-world applications, stream water temperature prediction and crop yield prediction, demonstrates the superiority of FREE over multiple baselines.
arXiv Detail & Related papers (2023-11-17T00:53:09Z)
Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities. We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z)
SIRI: Spatial Relation Induced Network For Spatial Description Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization. We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius. Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z)
Grounded Situation Recognition [56.18102368133022]
We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities. We show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval.
arXiv Detail & Related papers (2020-03-26T17:57:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.