Related papers: Distortions in Judged Spatial Relations in Large Language Models

Distortions in Judged Spatial Relations in Large Language Models

URL: http://arxiv.org/abs/2401.04218v2
Date: Tue, 4 Jun 2024 12:01:56 GMT
Title: Distortions in Judged Spatial Relations in Large Language Models
Authors: Nir Fulman, Abdulkadir Memduhoğlu, Alexander Zipf,
Abstract summary: GPT-4 exhibited superior performance with 55 percent accuracy, followed by GPT-3.5 at 47 percent, and Llama-2 at 45 percent. The models identified the nearest cardinal direction in most cases, reflecting their associative learning mechanism.
Score: 45.875801135769585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a benchmark for assessing the capability of Large Language Models (LLMs) to discern intercardinal directions between geographic locations and apply it to three prominent LLMs: GPT-3.5, GPT-4, and Llama-2. This benchmark specifically evaluates whether LLMs exhibit a hierarchical spatial bias similar to humans, where judgments about individual locations' spatial relationships are influenced by the perceived relationships of the larger groups that contain them. To investigate this, we formulated 14 questions focusing on well-known American cities. Seven questions were designed to challenge the LLMs with scenarios potentially influenced by the orientation of larger geographical units, such as states or countries, while the remaining seven targeted locations were less susceptible to such hierarchical categorization. Among the tested models, GPT-4 exhibited superior performance with 55 percent accuracy, followed by GPT-3.5 at 47 percent, and Llama-2 at 45 percent. The models showed significantly reduced accuracy on tasks with suspected hierarchical bias. For example, GPT-4's accuracy dropped to 33 percent on these tasks, compared to 86 percent on others. However, the models identified the nearest cardinal direction in most cases, reflecting their associative learning mechanism, thereby embodying human-like misconceptions. We discuss avenues for improving the spatial reasoning capabilities of LLMs.

Related papers

Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z)
Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants [0.0]
Virginia's seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions.<n>This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates.
arXiv Detail & Related papers (2025-07-27T21:49:58Z)
Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks [0.0]
Large Language Models (LLMs) can exhibit latent biases towards specific nationalities even when explicit demographic markers are not present.<n>We introduce a novel name-based benchmarking approach to investigate the impact of substituting explicit nationality labels with culturally indicative names.<n>Our experiments show that small models are less accurate and exhibit more bias compared to their larger counterparts.
arXiv Detail & Related papers (2025-07-22T19:54:49Z)
Evaluating the Sensitivity of LLMs to Prior Context [2.377922603550519]
Large language models (LLMs) are increasingly deployed in multi-turn dialogue and other sustained interactive scenarios.<n>We introduce a novel set of benchmarks that vary the volume and nature of prior context to measure sensitivity to contextual variations.<n>Our findings reveal that LLM performance on multiple-choice questions can degrade dramatically in multi-turn interactions.
arXiv Detail & Related papers (2025-05-29T16:09:32Z)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z)
Understanding Inequality of LLM Fact-Checking over Geographic Regions with Agent and Retrieval models [7.604241782666465]
We evaluate the factual accuracy of open and private models across a diverse set of regions and scenarios. Our findings reveal that regardless of the scenario and LLM used, statements from the Global North perform substantially better than those from the Global South.
arXiv Detail & Related papers (2025-03-28T21:07:43Z)
Hate Personified: Investigating the role of LLMs in content moderation [64.26243779985393]
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected.
arXiv Detail & Related papers (2024-10-03T16:43:17Z)
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions [6.19084217044276]
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
arXiv Detail & Related papers (2024-09-20T18:34:38Z)
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models [61.899791071654654]
We introduce a benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning. We investigate the performance of state-of-the-art vision-language models (VLMs) on this task. We develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues.
arXiv Detail & Related papers (2024-09-15T16:45:42Z)
Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study [4.80612909282198]
This study introduces a novel multi-task spatial evaluation dataset. The dataset encompasses twelve distinct task types, including spatial understanding and path planning. The study highlights the impact of prompt strategies on model performance in specific tasks.
arXiv Detail & Related papers (2024-08-26T17:25:16Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
Large Language Models are Zero-Shot Next Location Predictors [4.315451628809687]
Large Language Models (LLMs) have shown good generalization and reasoning capabilities. LLMs can obtain accuracies up to 36.2%, a significant improvement of almost 640% when compared to other models specifically designed for human mobility.
arXiv Detail & Related papers (2024-05-31T16:07:33Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
LM4OPT: Unveiling the Potential of Large Language Models in Formulating Mathematical Optimization Problems [0.0]
This study compares prominent Large Language Models, including GPT-3.5, GPT-4, and Llama-2-7b, in zero-shot and one-shot settings. Our findings show GPT-4's superior performance, particularly in the one-shot scenario.
arXiv Detail & Related papers (2024-03-02T23:32:33Z)
Split and Merge: Aligning Position Biases in Large Language Model based Evaluators [23.38206418382832]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias. Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested. It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z)
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.