Related papers: GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks

GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks

URL: http://arxiv.org/abs/2503.18129v2
Date: Wed, 22 Oct 2025 16:12:30 GMT
Title: GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks
Authors: Varvara Krechetova, Denis Kochedykov,
Abstract summary: This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs)<n>We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions.<n>Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI's GPT-4.1, GPT-4o and Google's Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in
Score: 0.11458853556386796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks in four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test rejection accuracy. We develop a LLM-as-Judge evaluation framework to compare agent solutions against reference solutions. Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI's GPT-4.1, GPT-4o and Google's Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in identifying unsolvable tasks. Claude Sonnet 4, due its preference to provide any solution rather than reject a task, proved to be less accurate. We observe significant differences in token usage, with Anthropic models consuming more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources (available at https://github.com/Solirinai/GeoBenchX), providing one more standardized method for the ongoing evaluation of LLMs for GeoAI.

Related papers

ConnectomeBench: Can LLMs Proofread the Connectome? [0.4999814847776097]
We introduce ConnectomeBench, a benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks.<n>We evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM.<n>Our results demonstrate that current models achieve surprisingly high performance in segment identification.<n>While the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics.
arXiv Detail & Related papers (2025-10-31T02:20:38Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Evaluating the Use of LLMs for Documentation to Code Traceability [3.076436880934678]
Large Language Models can establish trace links between various software documentation and source code.<n>We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI)<n>Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets.
arXiv Detail & Related papers (2025-06-19T16:18:53Z)
Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents [31.651748374218446]
Large language models (LLMs) have recently achieved remarkable results in complex multi-step tasks.<n>They often struggle to maintain consistent performance across multiple solution attempts.
arXiv Detail & Related papers (2025-05-19T18:50:15Z)
TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving [106.04001249574786]
TrustGeoGen is a data engine that generates formally verified geometric problems to establish a principled and trustworthy benchmark.<n>Our engine integrates four key innovations: 1) Multimodal Alignment, which synchronizes the generation of diagrams, text, and step-by-step solutions; 2) Formal Verification, ensuring all reasoning paths are rule-compliant; 3) Connection Thinking, bridging formal deduction with human-like logical steps; and 4) our textitGeoExplore series algorithms, which produce diverse problem variants with multiple solutions and self-reflective backtracking.
arXiv Detail & Related papers (2025-04-22T10:45:23Z)
Teamwork makes the dream work: LLMs-Based Agents for GitHub README.MD Summarization [7.330697128881243]
We propose Metagente as a novel approach to amplify the synergy of various Large Language Models (LLMs) Metagente is a Multi-Agent framework based on a series of LLMs to self-optimize the system through evaluation, feedback, and cooperation among specialized agents. The performance gain compared to GitSum, the most relevant benchmark, ranges from 27.63% to 60.43%.
arXiv Detail & Related papers (2025-03-13T20:42:39Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs [29.735465300269993]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often struggle with spatial reasoning. This paper presents a novel neural-symbolic framework that enhances LLMs' spatial reasoning abilities through iterative feedback between LLMs and Answer Set Programming (ASP) We evaluate our approach on two benchmark datasets: StepGame and SparQA.
arXiv Detail & Related papers (2024-11-27T18:04:05Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs.<n>It is constructed using a dataset curated from 30 well-known GitHub repositories.<n>We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
An LLM Agent for Automatic Geospatial Data Analysis [5.842462214442362]
Large language models (LLMs) are being used in data science code generation tasks. Their application to geospatial data processing is challenging due to difficulties in incorporating complex data structures and spatial constraints. We introduce GeoAgent, a new interactive framework designed to help LLMs handle geospatial data processing more effectively.
arXiv Detail & Related papers (2024-10-24T14:47:25Z)
GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration [43.96008600046952]
GraphTeam consists of five LLM-based agents from three modules, and the agents with different specialities can collaborate to address complex problems.<n>Experiments on six graph analysis benchmarks demonstrate that GraphTeam achieves state-of-the-art performance with an average 25.85% improvement over the best baseline in terms of accuracy.
arXiv Detail & Related papers (2024-10-23T17:02:59Z)
Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models [88.4320775961431]
We introduce ProGraph, a benchmark for large language models (LLMs) to process graphs. Our findings reveal that the performance of current LLMs is unsatisfactory, with the best model achieving only 36% accuracy. We propose LLM4Graph datasets, which include crawled documents and auto-generated codes based on 6 widely used graph libraries.
arXiv Detail & Related papers (2024-09-29T11:38:45Z)
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning [60.55556283848063]
Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. We propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard.
arXiv Detail & Related papers (2024-09-18T17:12:41Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies [47.129504708849446]
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing. LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution. In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available.
arXiv Detail & Related papers (2024-02-27T10:44:52Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z)
Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis [7.099257763803159]
We evaluate the capabilities of four Large Language Models (LLMs) in addressing several analytical problems with graph data. We employ four distinct evaluation metrics: Correctness, Fidelity, and Rectification. GPT models can generate logical and coherent results, outperforming alternatives in correctness.
arXiv Detail & Related papers (2023-08-22T06:32:07Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.