Related papers: LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

URL: http://arxiv.org/abs/2511.10459v2
Date: Mon, 17 Nov 2025 19:46:07 GMT
Title: LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning
Authors: Zihan Gao, Yifei Xu, Jacob Thebault-Spieker,
Abstract summary: Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, but their ability to handle hyper-local knowledge remains poorly understood.<n>We present LocalBench, the first benchmark designed to evaluate LLMs on county-level local knowledge across the United States.<n>Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings.
Score: 9.319308493696893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.

Related papers

Metadata Conditioned Large Language Models for Localization [25.913929585741034]
We show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization.<n>Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal.<n>After instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data.
arXiv Detail & Related papers (2026-01-21T18:20:59Z)
Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z)
Towards Explainable Bilingual Multimodal Misinformation Detection and Localization [64.37162720126194]
BiMi is a framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis.<n>BiMiBench is a benchmark constructed by systematically editing real news images and subtitles.<n>BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore.
arXiv Detail & Related papers (2025-06-28T15:43:06Z)
NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge [11.430887334254422]
We propose the NativQA framework, which can seamlessly construct large-scale, culturally and regionally aligned QA datasets in native languages.<n>The framework has been evaluated across 39 locations in 24 countries and in 7 languages, resulting in over 300K Question-Answer pairs.
arXiv Detail & Related papers (2025-04-08T13:01:51Z)
Understanding Inequality of LLM Fact-Checking over Geographic Regions with Agent and Retrieval models [7.604241782666465]
We evaluate the factual accuracy of open and private models across a diverse set of regions and scenarios.<n>Our findings reveal that regardless of the scenario and LLM used, statements from the Global North perform substantially better than those from the Global South.
arXiv Detail & Related papers (2025-03-28T21:07:43Z)
Beyond the Surface: Uncovering Implicit Locations with LLMs for Personalized Local News [0.2749898166276854]
This paper explores Large Language Models (LLMs) for local article classification in Taboola's "Homepage For You" system.<n>LLMs offer new possibilities while raising concerns about accuracy and explainability.<n>A scalable pipeline integrating LLM-based location classification boosted local article distribution by 27%.
arXiv Detail & Related papers (2025-02-20T15:55:52Z)
TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time [9.745912505259312]
We present TiEBe, a dataset of over 23,000 question-answer pairs centered on notable global and regional events.<n>These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments.<n>Our results reveal significant geographic disparities in factual recall, emphasizing the need for more balanced global representation.
arXiv Detail & Related papers (2025-01-13T16:58:32Z)
Distortions in Judged Spatial Relations in Large Language Models [45.875801135769585]
GPT-4 exhibited superior performance with 55 percent accuracy, followed by GPT-3.5 at 47 percent, and Llama-2 at 45 percent. The models identified the nearest cardinal direction in most cases, reflecting their associative learning mechanism.
arXiv Detail & Related papers (2024-01-08T20:08:04Z)
Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z)
GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z)
Jalisco's multiclass land cover analysis and classification using a novel lightweight convnet with real-world multispectral and relief data [51.715517570634994]
We present our novel lightweight (only 89k parameters) Convolution Neural Network (ConvNet) to make LC classification and analysis. In this work, we combine three real-world open data sources to obtain 13 channels. Our embedded analysis anticipates the limited performance in some classes and gives us the opportunity to group the most similar.
arXiv Detail & Related papers (2022-01-26T14:58:51Z)
Enhancing Prototypical Few-Shot Learning by Leveraging the Local-Level Strategy [75.63022284445945]
We find that the existing works often build their few-shot model based on the image-level feature by mixing all local-level features. We present (a) a local-agnostic training strategy to avoid the discriminative location bias between the base and novel categories, and (b) a novel local-level similarity measure to capture the accurate comparison between local-level features.
arXiv Detail & Related papers (2021-11-08T08:45:15Z)
Capturing Structural Locality in Non-parametric Language Models [85.94669097485992]
We propose a simple yet effective approach for adding locality information into non-parametric language models. Experiments on two different domains, Java source code and Wikipedia text, demonstrate that locality features improve model efficacy.
arXiv Detail & Related papers (2021-10-06T15:53:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.