LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning
- URL: http://arxiv.org/abs/2506.13841v1
- Date: Mon, 16 Jun 2025 16:23:56 GMT
- Title: LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning
- Authors: Miho Koda, Yu Zheng, Ruixian Ma, Mingyang Sun, Devesh Pansare, Fabio Duarte, Paolo Santi,
- Abstract summary: We introduce LocationReasoner, a benchmark designed to evaluate large language models' reasoning abilities in the context of real-world site selection.<n>The benchmark comprises over 300 carefully crafted queries of varying difficulty levels, supported by in-house tools for constraint-based location search.<n>Extensive evaluations reveal that state-of-the-art reasoning models offer limited improvement over their non-reasoning predecessors in real-world contexts.
- Score: 12.265350534588817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models (LLMs), particularly those enhanced through reinforced post-training, have demonstrated impressive reasoning capabilities, as exemplified by models such as OpenAI o1 and DeepSeek-R1. However, these capabilities are predominantly benchmarked on domains like mathematical problem solving and code generation -- leaving open the question of whether such reasoning skills generalize to complex, real-world scenarios. In this paper, we introduce LocationReasoner, a benchmark designed to evaluate LLMs' reasoning abilities in the context of real-world site selection, where models must identify feasible locations by reasoning over diverse and complicated spatial, environmental, and logistical constraints. The benchmark comprises over 300 carefully crafted queries of varying difficulty levels, supported by a sandbox environment with in-house tools for constraint-based location search. Extensive evaluations reveal that state-of-the-art reasoning models offer limited improvement over their non-reasoning predecessors in real-world contexts, with even the latest OpenAI o4 model failing on 30% of site selection tasks. Moreover, agentic strategies such as ReAct and Reflexion often suffer from over-reasoning, leading to worse outcomes than direct code-generation prompting. With key limitations of LLMs in holistic and non-linear reasoning highlighted, we release LocationReasoner to foster the development of LLMs and agents capable of robust, grounded reasoning in real-world decision-making tasks. Codes and data for our benchmark are available at https://github.com/miho-koda/LocationReasoner.
Related papers
- Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy [28.293009223912602]
Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall.<n>This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment.<n>We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's taxonomy.
arXiv Detail & Related papers (2026-01-28T05:01:11Z) - From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z) - Agentic AI Reasoning for Mobile Edge General Intelligence: Fundamentals, Approaches, and Directions [74.35421055079655]
Large language models (LLMs) have enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities.<n>Mobile Edge General Intelligence (MEGI) brings real-time, privacy-preserving reasoning to the network edge.<n>We propose a joint optimization framework for efficient LLM reasoning deployment in MEGI.
arXiv Detail & Related papers (2025-09-27T10:53:48Z) - seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs [1.0519693622157462]
We introduce seqBench, a benchmark for probing sequential reasoning limits in Large Language Models (LLMs)<n>We find that even top-performing models systematically fail on seqBench's structured reasoning tasks despite minimal search complexity.
arXiv Detail & Related papers (2025-09-21T01:32:13Z) - HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds [0.0]
Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming.<n>But their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored.<n>We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds.
arXiv Detail & Related papers (2025-08-18T09:59:02Z) - Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z) - DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance [3.9770095824794516]
We introduce DeepQuestion, a scalable automated framework that augments existing datasets based on Bloom's taxonomy.<n>We show substantial performance drops (even up to 70% accuracy loss) on higher-order tasks, underscoring persistent gaps in deep reasoning.
arXiv Detail & Related papers (2025-05-30T12:39:42Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges [4.668749313973097]
This paper systematically evaluate Large Language Models (LLMs) and Large Reasoning Models (LRMs) across three levels of reasoning complexity.<n>We curate 26 challenges where models answer directly or by Python Code Interpreter.<n>LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods.
arXiv Detail & Related papers (2025-05-16T18:32:35Z) - Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z) - LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [7.379503137362718]
We introduce LR$2$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of Large Language Models.<n>Our evaluation reveals that even the most advanced LRMs, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR$2$Bench.
arXiv Detail & Related papers (2025-02-25T04:51:17Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.<n>We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to Determinacy [76.58614128865652]
We propose DetermLR, a novel perspective that rethinks the reasoning process as an evolution from indeterminacy to determinacy.
First, we categorize known conditions into two types: determinate and indeterminate premises This provides an oveall direction for the reasoning process and guides LLMs in converting indeterminate data into progressively determinate insights.
We automate the storage and extraction of available premises and reasoning paths with reasoning memory, preserving historical reasoning details for subsequent reasoning steps.
arXiv Detail & Related papers (2023-10-28T10:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.