AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models
- URL: http://arxiv.org/abs/2505.12900v1
- Date: Mon, 19 May 2025 09:35:58 GMT
- Title: AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models
- Authors: Shuyang Hou, Zhangxiao Shen, Huayi Wu, Jianyuan Liang, Haoyue Jiao, Yaxian Qing, Xiaopu Zhang, Xu Li, Zhipeng Gui, Xuefeng Guan, Longgang Xiang,
- Abstract summary: AutoGEEval is an evaluation framework for code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs)<n>Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types.<n>We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation.
- Score: 2.115331311872418
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline-from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.
Related papers
- GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation [8.019960494784039]
GeoJSEval is a multimodal, function-level automatic evaluation framework for LLMs in JavaScript-based code generation.<n>It includes 432 function-level tasks and 2,071 structured test cases spanning five widely used JavaScript geospatial libraries and 25 mainstream geospatial data types.<n>We conduct a comprehensive evaluation of 18 state-of-the-art LLMs using GeoJSEval, revealing significant performance disparities and bottlenecks in spatial semantic understanding, code reliability, and function invocation accuracy.
arXiv Detail & Related papers (2025-07-28T06:38:38Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [48.24550684610705]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine [9.161127232765063]
This study presents AutoGEEval++, and the first automated assessment system for large language models (LLMs) generating geospatial code on Google Earth Engine (GEE)<n>Built on the GEE Python API, AutoGEEval++ features a benchmark dataset-AutoGEEval++-Bench with 6,365 test cases across 26 data types and three task categories: unit, combo, and theme tests.<n>Using AutoGEEval++, we evaluate 24 state-of-the-art LLMs, including general-purpose, reasoning-enhanced, code-centric, and geoscience-specific models.
arXiv Detail & Related papers (2025-06-12T05:42:37Z) - ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [54.52092001110694]
ThinkGeo is a benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z) - TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving [66.0201510984171]
We propose a scalable data engine called TrustGeoGen for problem generation.<n>By formal verification, TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity.<n> Experiments reveal the state-of-the-art models achieve only 49.17% accuracy on GeoTrust-test.
arXiv Detail & Related papers (2025-04-22T10:45:23Z) - GEE-OPs: An Operator Knowledge Base for Geospatial Code Generation on the Google Earth Engine Platform Powered by Large Language Models [0.562479170374811]
We propose a framework for building a operator knowledge base tailored to the Google Earth Engine (GEE) JavaScript API.<n>This framework consists of an operator syntax knowledge table, an operator relationship frequency table, an operator frequent pattern knowledge table, and an operator relationship chain knowledge table.<n>We show that the framework achieves over 90% accuracy, recall, and F1 score in operator knowledge extraction.
arXiv Detail & Related papers (2024-12-07T08:50:24Z) - Geo-FuB: A Method for Constructing an Operator-Function Knowledge Base for Geospatial Code Generation Tasks Using Large Language Models [0.5242869847419834]
This study introduces a framework to construct such a knowledge base, leveraging geospatial script semantics.
An example knowledge base, Geo-FuB, built from 154,075 Google Earth Engine scripts, is available on GitHub.
arXiv Detail & Related papers (2024-10-28T12:50:27Z) - AICoderEval: Improving AI Domain Code Generation of Large Language Models [10.060988050644076]
We open-source the AICoderEval dataset to facilitate research in this area.
We propose CoderGen, an agent-based framework, to help LLMs generate codes related to real-world tasks.
We train a more powerful task-specific code generation model, named AICoder, which is refined on llama-3 based on AICoderEval.
arXiv Detail & Related papers (2024-06-07T07:45:38Z) - AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning [54.47116888545878]
AutoAct is an automatic agent learning framework for QA.
It does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models.
arXiv Detail & Related papers (2024-01-10T16:57:24Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - PAIRS AutoGeo: an Automated Machine Learning Framework for Massive
Geospatial Data [7.742399489996169]
An automated machine learning framework for geospatial data named PAIRS AutoGeo is introduced on IBM PAIRS Geoscope big data and analytics platform.
The framework gathers required data at the location coordinates, assembles the training data, performs quality check, and trains multiple machine learning models for subsequent deployment.
This use case exemplifies how PAIRS AutoGeo enables users to leverage machine learning without extensive geospatial expertise.
arXiv Detail & Related papers (2020-12-12T21:12:41Z) - Few-shot Natural Language Generation for Task-Oriented Dialog [113.07438787659859]
We present FewShotWoz, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems.
We develop the SC-GPT model, which is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability.
Experiments on FewShotWoz and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods.
arXiv Detail & Related papers (2020-02-27T18:48:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.