Quantifying Geospatial in the Common Crawl Corpus
- URL: http://arxiv.org/abs/2406.04952v2
- Date: Thu, 29 Aug 2024 16:49:29 GMT
- Title: Quantifying Geospatial in the Common Crawl Corpus
- Authors: Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, James Haworth,
- Abstract summary: This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model.
We estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses.
- Score: 0.07499722271664144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.
Related papers
- OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence.
We propose a MLLM (OmniGeo) tailored to geospatial applications.
By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z) - Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.
GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.
We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - Where on Earth Do Users Say They Are?: Geo-Entity Linking for Noisy Multilingual User Input [2.516307239032451]
We present a method which represents real-world locations as averaged embeddings from labeled user-input location names.
We show that our approach improves geo-entity linking on a global and multilingual social media dataset.
arXiv Detail & Related papers (2024-04-29T15:18:33Z) - GeoGalactica: A Scientific Large Language Model in Geoscience [95.15911521220052]
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP)
We specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset.
We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus.
Then we fine-tune the model with 1 million pairs of instruction-tuning
arXiv Detail & Related papers (2023-12-31T09:22:54Z) - GeoLM: Empowering Language Models for Geospatially Grounded Language
Understanding [45.36562604939258]
This paper introduces GeoLM, a language model that enhances the understanding of geo-entities in natural language.
We demonstrate that GeoLM exhibits promising capabilities in supporting toponym recognition, toponym linking, relation extraction, and geo-entity typing.
arXiv Detail & Related papers (2023-10-23T01:20:01Z) - GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z) - Are Large Language Models Geospatially Knowledgeable? [21.401931052512595]
This paper investigates the extent of geospatial knowledge, awareness, and reasoning abilities encoded within Large Language Models (LLM)
With a focus on autoregressive language models, we devise experimental approaches related to (i) probing LLMs for geo-coordinates to assess geospatial knowledge, (ii) using geospatial and non-geospatial prepositions to gauge their geospatial awareness, and (iii) utilizing a multidimensional scaling (MDS) experiment to assess the models' geospatial reasoning capabilities.
arXiv Detail & Related papers (2023-10-09T17:20:11Z) - Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese
Geographic Re-Ranking [61.60169764507917]
Chinese geographic re-ranking task aims to find the most relevant addresses among retrieved candidates.
We propose an innovative framework, namely Geo-Encoder, to more effectively integrate Chinese geographical semantics into re-ranking pipelines.
arXiv Detail & Related papers (2023-09-04T13:44:50Z) - K2: A Foundation Language Model for Geoscience Knowledge Understanding
and Utilization [105.89544876731942]
Large language models (LLMs) have achieved great success in general domains of natural language processing.
We present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience.
arXiv Detail & Related papers (2023-06-08T09:29:05Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - MGeo: Multi-Modal Geographic Pre-Training Method [49.78466122982627]
We propose a novel query-POI matching method Multi-modal Geographic language model (MGeo)
MGeo represents GC as a new modality and is able to fully extract multi-modal correlations for accurate query-POI matching.
Our proposed multi-modal pre-training method can significantly improve the query-POI matching capability of generic PTMs.
arXiv Detail & Related papers (2023-01-11T03:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.