GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2GeoSQL Queries
- URL: http://arxiv.org/abs/2509.25264v2
- Date: Thu, 02 Oct 2025 13:58:56 GMT
- Title: GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2GeoSQL Queries
- Authors: Shuyang Hou, Haoyue Jiao, Ziqi Liu, Lutong Xie, Guanyu Chen, Shaowen Wu, Xuefeng Guan, Huayi Wu,
- Abstract summary: We present Geo-Eval, the first end-to-end automated evaluation framework for PostGIS generation.<n>We also release a public Geo-Eval leaderboard platform for continuous testing and global comparison.
- Score: 12.523407991161315
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have shown strong performance in natural language to SQL (NL2SQL) tasks within general databases. However, extending to GeoSQL introduces additional complexity from spatial data types, function invocation, and coordinate systems, which greatly increases generation and execution difficulty. Existing benchmarks mainly target general SQL, and a systematic evaluation framework for GeoSQL is still lacking. To fill this gap, we present GeoSQL-Eval, the first end-to-end automated evaluation framework for PostGIS query generation, together with GeoSQL-Bench, a benchmark for assessing LLM performance in NL2GeoSQL tasks. GeoSQL-Bench defines three task categories-conceptual understanding, syntax-level SQL generation, and schema retrieval-comprising 14,178 instances, 340 PostGIS functions, and 82 thematic databases. GeoSQL-Eval is grounded in Webb's Depth of Knowledge (DOK) model, covering four cognitive dimensions, five capability levels, and twenty task types to establish a comprehensive process from knowledge acquisition and syntax generation to semantic alignment, execution accuracy, and robustness. We evaluate 24 representative models across six categories and apply the entropy weight method with statistical analyses to uncover performance differences, common error patterns, and resource usage. Finally, we release a public GeoSQL-Eval leaderboard platform for continuous testing and global comparison. This work extends the NL2GeoSQL paradigm and provides a standardized, interpretable, and extensible framework for evaluating LLMs in spatial database contexts, offering valuable references for geospatial information science and related applications.
Related papers
- Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning [52.075928878249066]
Vision-guided models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements.<n>We introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language representations.<n>We propose GeoDPO, a translator reinforcement learning framework.
arXiv Detail & Related papers (2026-02-26T07:28:04Z) - FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL [4.973502845481286]
FLOOD-BENCH is a benchmark for the flood management domain that integrates heterogeneous datasets through key-based, spatial, and hybrid joins.<n>The benchmark captures realistic flood-related information needs by combining social, infrastructural, and hazard data layers.
arXiv Detail & Related papers (2025-12-12T23:25:00Z) - GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI [52.13138825802668]
GeoFMs are transforming Earth Observation, but evaluation lacks standardized protocols.<n> GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation.<n>Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
arXiv Detail & Related papers (2025-11-19T17:45:02Z) - From Questions to Queries: An AI-powered Multi-Agent Framework for Spatial Text-to-SQL [0.4499833362998488]
Single-agent approaches often struggle with the semantic and syntactic complexities of spatial queries.<n>We propose a multi-agent framework designed to accurately translate natural language questions into spatial queries.<n>We evaluate our system using both the non-spatial KaggleDBQA benchmark and a new, comprehensive SpatialQA benchmark.
arXiv Detail & Related papers (2025-10-23T22:58:17Z) - GeoJSON Agents:A Multi-Agent LLM Architecture for Geospatial Analysis-Function Calling vs Code Generation [7.335354895959486]
This study is the first to introduce an LLM multi-agent framework for GeoJSON data.<n>The architecture consists of three components-task parsing, agent collaboration, and result integration.
arXiv Detail & Related papers (2025-09-10T03:43:46Z) - GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation [32.22754624992446]
We present GeoAnalystBench, a benchmark of 50 Python-based tasks derived from real-world geospatial problems.<n>Using this benchmark, we assess both proprietary and open source models.<n>Results reveal a clear gap: proprietary models such as ChatGPT-4o-mini achieve high 95% validity and stronger code alignment.
arXiv Detail & Related papers (2025-09-07T00:51:57Z) - GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation [8.019960494784039]
GeoJSEval is a multimodal, function-level automatic evaluation framework for LLMs in JavaScript-based code generation.<n>It includes 432 function-level tasks and 2,071 structured test cases spanning five widely used JavaScript geospatial libraries and 25 mainstream geospatial data types.<n>We conduct a comprehensive evaluation of 18 state-of-the-art LLMs using GeoJSEval, revealing significant performance disparities and bottlenecks in spatial semantic understanding, code reliability, and function invocation accuracy.
arXiv Detail & Related papers (2025-07-28T06:38:38Z) - AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine [9.161127232765063]
This study presents AutoGEEval++, and the first automated assessment system for large language models (LLMs) generating geospatial code on Google Earth Engine (GEE)<n>Built on the GEE Python API, AutoGEEval++ features a benchmark dataset-AutoGEEval++-Bench with 6,365 test cases across 26 data types and three task categories: unit, combo, and theme tests.<n>Using AutoGEEval++, we evaluate 24 state-of-the-art LLMs, including general-purpose, reasoning-enhanced, code-centric, and geoscience-specific models.
arXiv Detail & Related papers (2025-06-12T05:42:37Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - Towards Evaluating Large Language Models for Graph Query Generation [49.49881799107061]
Large Language Models (LLMs) are revolutionizing the landscape of Generative Artificial Intelligence (GenAI)
This paper presents a comparative study addressing the challenge of generating queries a powerful language for interacting with graph databases using open-access LLMs.
Our empirical analysis of query generation accuracy reveals that Claude Sonnet 3.5 outperforms its counterparts in this specific domain.
arXiv Detail & Related papers (2024-11-13T09:11:56Z) - An LLM Agent for Automatic Geospatial Data Analysis [5.842462214442362]
Large language models (LLMs) are being used in data science code generation tasks.
Their application to geospatial data processing is challenging due to difficulties in incorporating complex data structures and spatial constraints.
We introduce GeoAgent, a new interactive framework designed to help LLMs handle geospatial data processing more effectively.
arXiv Detail & Related papers (2024-10-24T14:47:25Z) - Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z) - MGeo: Multi-Modal Geographic Pre-Training Method [49.78466122982627]
We propose a novel query-POI matching method Multi-modal Geographic language model (MGeo)
MGeo represents GC as a new modality and is able to fully extract multi-modal correlations for accurate query-POI matching.
Our proposed multi-modal pre-training method can significantly improve the query-POI matching capability of generic PTMs.
arXiv Detail & Related papers (2023-01-11T03:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.