Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation
- URL: http://arxiv.org/abs/2511.16853v1
- Date: Thu, 20 Nov 2025 23:28:30 GMT
- Title: Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation
- Authors: Xizhe Xue, Xiao Xiang Zhu,
- Abstract summary: REO-Instruct is the first unified benchmark designed for both descriptive and regression tasks in Earth Observation datasets.<n>The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline.
- Score: 10.566310652119148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline. Comprehensive evaluation protocols and baseline results across generic VLMs reveal that current models struggle with numeric reasoning, highlighting an essential challenge for scientific VLMs. REO-Instruct offers a standardized foundation for developing and assessing next-generation geospatial models capable of both description and scientific inference. The project page are publicly available at \href{https://github.com/zhu-xlab/REO-Instruct}{REO-Instruct}.
Related papers
- TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z) - BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data [5.63123057685834]
BotaCLIP is a lightweight framework for adapting a pre-trained foundation model to inject domain-specific knowledge.<n>We show how BotaCLIP can inject expert knowledge into data-scarce settings, enabling frugal representation learning.
arXiv Detail & Related papers (2025-11-26T09:19:06Z) - GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI [52.13138825802668]
GeoFMs are transforming Earth Observation, but evaluation lacks standardized protocols.<n> GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation.<n>Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
arXiv Detail & Related papers (2025-11-19T17:45:02Z) - TOPO-Bench: An Open-Source Topological Mapping Evaluation Framework with Quantifiable Perceptual Aliasing [10.736029638634504]
We formalize topological consistency as the fundamental property of topological maps and show that localization accuracy provides an efficient surrogate metric.<n>We propose the first quantitative measure of dataset ambiguity to enable fair comparisons across environments.<n>All datasets, baselines, and evaluation tools are fully open-sourced to foster consistent and reproducible research in topological mapping.
arXiv Detail & Related papers (2025-10-05T08:58:08Z) - Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents [49.3216026940601]
Earth observation is essential for understanding the states of the Earth system.<n>Recent MLLMs have advanced EO research, but they still lack the capability to tackle complex tasks that require multi-step reasoning.<n>We introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem.
arXiv Detail & Related papers (2025-09-27T06:04:28Z) - PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs [88.98041407783502]
PRING is the first benchmark that evaluates protein-protein interaction prediction from a graph-level perspective.<n> PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions.
arXiv Detail & Related papers (2025-07-07T15:21:05Z) - Adapting Vision-Language Models for Evaluating World Models [24.813041196394582]
We present UNIVERSE, a method for adapting Vision-language Evaluator for Rollouts in Simulated Environments under data and compute constraints.<n>We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions.<n>The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint.
arXiv Detail & Related papers (2025-06-22T09:53:28Z) - EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM [103.7537991413311]
Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics.<n>Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs.<n>We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs.
arXiv Detail & Related papers (2025-06-02T13:36:05Z) - Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [63.54377402784965]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN)<n>Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners.<n> Experiments on both the discrete environments (R2R, REVERIE, and R4R dataset) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z) - Regression in EO: Are VLMs Up to the Challenge? [18.343600857006763]
Vision Language Models (VLMs) have achieved remarkable success in perception and reasoning tasks.<n>This paper systematically examines the challenges and opportunities of adapting VLMs for EO regression tasks.
arXiv Detail & Related papers (2025-02-19T20:27:54Z) - REO-VLM: Transforming VLM to Meet Regression Challenges in Earth Observation [58.91579272882073]
This paper introduces a novel benchmark dataset, called textbfREO-Instruct to unify regression and generation tasks specifically for the Earth Observation domain.<n>We develop textbfREO-VLM, a groundbreaking model that seamlessly integrates regression capabilities with traditional generative functions.
arXiv Detail & Related papers (2024-12-21T11:17:15Z) - A Biologically Interpretable Two-stage Deep Neural Network (BIT-DNN) For
Vegetation Recognition From Hyperspectral Imagery [3.708283803668841]
This study proposes a novel interpretable deep learning model -- a biologically interpretable two-stage deep neural network (BIT-DNN)
The proposed model has been compared with five state-of-the-art deep learning models.
arXiv Detail & Related papers (2020-04-19T15:58:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.