Related papers: AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine

AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine

URL: http://arxiv.org/abs/2506.10365v1
Date: Thu, 12 Jun 2025 05:42:37 GMT
Title: AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine
Authors: Shuyang Hou, Zhangxiao Shen, Huayi Wu, Haoyue Jiao, Ziqi Liu, Lutong Xie, Chang Liu, Jianyuan Liang, Yaxian Qing, Xiaopu Zhang, Dehua Peng, Zhipeng Gui, Xuefeng Guan,
Abstract summary: This study presents AutoGEEval++, and the first automated assessment system for large language models (LLMs) generating geospatial code on Google Earth Engine (GEE)<n>Built on the GEE Python API, AutoGEEval++ features a benchmark dataset-AutoGEEval++-Bench with 6,365 test cases across 26 data types and three task categories: unit, combo, and theme tests.<n>Using AutoGEEval++, we evaluate 24 state-of-the-art LLMs, including general-purpose, reasoning-enhanced, code-centric, and geoscience-specific models.
Score: 9.161127232765063
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Geospatial code generation is becoming a key frontier in integrating artificial intelligence with geo-scientific analysis, yet standardised automated evaluation tools for this task remain absent. This study presents AutoGEEval++, an enhanced framework building on AutoGEEval, and the first automated assessment system for large language models (LLMs) generating geospatial code on Google Earth Engine (GEE). It supports diverse data modalities and varying task complexities. Built on the GEE Python API, AutoGEEval++ features a benchmark dataset-AutoGEEval++-Bench-with 6,365 test cases across 26 data types and three task categories: unit, combo, and theme tests. It includes a submission programme and a judge module to realise an end-to-end automated evaluation pipeline from code generation to execution-based validation. The framework adopts multi-dimensional metrics-accuracy, resource usage, run-time efficiency, and error types-balancing hallucination control and efficiency, and enabling boundary testing and error pattern analysis. Using AutoGEEval++, we evaluate 24 state-of-the-art LLMs (as of June 2025), including general-purpose, reasoning-enhanced, code-centric, and geoscience-specific models. Results reveal clear performance, stability, and error differences across task types, model designs, and deployment settings, confirming AutoGEEval++'s practical value and scalability in vertical-domain code generation. This work establishes the first standardised evaluation protocol and foundational benchmark for GEE-based LLM code generation, providing a unified basis for performance comparison and a methodological framework for systematic, domain-specific code evaluation.

Related papers

GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation [8.019960494784039]
GeoJSEval is a multimodal, function-level automatic evaluation framework for LLMs in JavaScript-based code generation.<n>It includes 432 function-level tasks and 2,071 structured test cases spanning five widely used JavaScript geospatial libraries and 25 mainstream geospatial data types.<n>We conduct a comprehensive evaluation of 18 state-of-the-art LLMs using GeoJSEval, revealing significant performance disparities and bottlenecks in spatial semantic understanding, code reliability, and function invocation accuracy.
arXiv Detail & Related papers (2025-07-28T06:38:38Z)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [48.24550684610705]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [54.52092001110694]
ThinkGeo is a benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models [2.115331311872418]
AutoGEEval is an evaluation framework for code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs)<n>Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types.<n>We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation.
arXiv Detail & Related papers (2025-05-19T09:35:58Z)
TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving [66.0201510984171]
We propose a scalable data engine called TrustGeoGen for problem generation.<n>By formal verification, TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity.<n> Experiments reveal the state-of-the-art models achieve only 49.17% accuracy on GeoTrust-test.
arXiv Detail & Related papers (2025-04-22T10:45:23Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
Geo-FuB: A Method for Constructing an Operator-Function Knowledge Base for Geospatial Code Generation Tasks Using Large Language Models [0.5242869847419834]
This study introduces a framework to construct such a knowledge base, leveraging geospatial script semantics. An example knowledge base, Geo-FuB, built from 154,075 Google Earth Engine scripts, is available on GitHub.
arXiv Detail & Related papers (2024-10-28T12:50:27Z)
Is ChatGPT a game changer for geocoding -- a benchmark for geocoding address parsing techniques [3.759936323189418]
We introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C. We train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models.
arXiv Detail & Related papers (2023-10-22T17:03:56Z)
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z)
AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures. We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS. Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.