Related papers: LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

URL: http://arxiv.org/abs/2602.16902v3
Date: Mon, 23 Feb 2026 11:03:50 GMT
Title: LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?
Authors: Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic,
Abstract summary: LLM-Wikirace is a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs)<n>We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5.<n>Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, planning and long-horizon reasoning capabilities become the dominant factors.
Score: 15.465732997309182
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

Related papers

Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go [74.28228642327726]
Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding.<n>LoGos is a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language.<n>LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs.
arXiv Detail & Related papers (2026-01-23T05:00:49Z)
Do You Get the Hint? Benchmarking LLMs on the Board Game Concept [1.671764884922859]
Large language models (LLMs) have achieved striking successes on many benchmarks, yet recent studies continue to expose fundamental weaknesses.<n>In this paper, we introduce Concept, a simple word-guessing board game, as a benchmark for probing abductive reasoning in a representation that is much closer to natural language data.<n>Our results show that this game, easily solved by humans (with a success rate of over 90%), is still very challenging for state-of-the-art LLMs.
arXiv Detail & Related papers (2025-10-15T08:17:25Z)
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? [88.29001498765629]
Large language models (LLMs) now outperform elite humans in competitive programming.<n>We revisit this claim, examining how LLMs differ from human experts and where limitations still remain.<n>We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI.<n>A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.
arXiv Detail & Related papers (2025-06-13T16:29:09Z)
GAMEBoT: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.<n>We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.<n>Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z)
Are Your LLMs Capable of Stable Reasoning? [38.03049704515947]
We introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts.<n>We employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency.
arXiv Detail & Related papers (2024-12-17T18:12:47Z)
GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps [5.874552372073687]
Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. We propose GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps. GPT-4-Turbo achieved the highest score of 44.97% on GTB_Score (GTBS), a composite score that combines the three above criteria.
arXiv Detail & Related papers (2024-10-10T09:54:28Z)
Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path [53.71787069694794]
We focus on the graph reasoning ability of Large Language Models (LLMs)<n>We revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem.<n>Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these fundamental tasks.
arXiv Detail & Related papers (2024-08-18T16:26:39Z)
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs) In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z)
When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z)
Graph-enhanced Large Language Models in Asynchronous Plan Reasoning [18.402877904882107]
We find that large language models (LLMs) behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-02-05T08:26:33Z)
Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents [99.17668730578586]
Pre-trained large language models (LLMs) capture procedural knowledge about the world. Plan, Eliminate, and Track (PET) framework translates a task description into a list of high-level sub-tasks. PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
arXiv Detail & Related papers (2023-05-03T20:11:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.