Quantifying Geospatial in the Common Crawl Corpus
- URL: http://arxiv.org/abs/2406.04952v1
- Date: Fri, 7 Jun 2024 14:16:37 GMT
- Title: Quantifying Geospatial in the Common Crawl Corpus
- Authors: Ilya Ilyankou, Meihui Wang, James Haworth, Stefano Cavazzi,
- Abstract summary: This paper investigates the prevalence of geospatial data in Common Crawl releases using Gemini, a powerful language model.
We estimate that between 1 in 5 and 1 in 6 documents contain geospatial information such as coordinates and street addresses.
- Score: 0.07499722271664144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that between 1 in 5 and 1 in 6 documents contain geospatial information such as coordinates and street addresses. Our findings provide quantitative insights into the nature and extent of geospatial data within Common Crawl, and web crawl data in general. Furthermore, we formulate questions to guide future investigations into the geospatial content of available web crawl datasets and its influence on LLMs.
Related papers
- Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - GeoGalactica: A Scientific Large Language Model in Geoscience [95.15911521220052]
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP)
We specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset.
We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus.
Then we fine-tune the model with 1 million pairs of instruction-tuning
arXiv Detail & Related papers (2023-12-31T09:22:54Z) - GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z) - Are Large Language Models Geospatially Knowledgeable? [21.401931052512595]
This paper investigates the extent of geospatial knowledge, awareness, and reasoning abilities encoded within Large Language Models (LLM)
With a focus on autoregressive language models, we devise experimental approaches related to (i) probing LLMs for geo-coordinates to assess geospatial knowledge, (ii) using geospatial and non-geospatial prepositions to gauge their geospatial awareness, and (iii) utilizing a multidimensional scaling (MDS) experiment to assess the models' geospatial reasoning capabilities.
arXiv Detail & Related papers (2023-10-09T17:20:11Z) - K2: A Foundation Language Model for Geoscience Knowledge Understanding
and Utilization [105.89544876731942]
Large language models (LLMs) have achieved great success in general domains of natural language processing.
We present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience.
arXiv Detail & Related papers (2023-06-08T09:29:05Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - MGeo: Multi-Modal Geographic Pre-Training Method [49.78466122982627]
We propose a novel query-POI matching method Multi-modal Geographic language model (MGeo)
MGeo represents GC as a new modality and is able to fully extract multi-modal correlations for accurate query-POI matching.
Our proposed multi-modal pre-training method can significantly improve the query-POI matching capability of generic PTMs.
arXiv Detail & Related papers (2023-01-11T03:05:12Z) - GeoDE: a Geographically Diverse Evaluation Dataset for Object
Recognition [31.194474203667042]
GeoDE is a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions.
We release the full dataset and code at https://geodiverse-data-collection.cs.princeton.edu/.
arXiv Detail & Related papers (2023-01-05T18:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.