CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl
- URL: http://arxiv.org/abs/2405.11039v2
- Date: Wed, 29 May 2024 09:16:28 GMT
- Title: CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl
- Authors: Ilya Ilyankou, Meihui Wang, James Haworth, Stefano Cavazzi,
- Abstract summary: The Common Crawl (CC) corpus is the largest open web crawl dataset containing 9.5+ petabytes of data captured since 2008.
In this paper, we introduce an efficient pipeline to extract annotated user-generated tracks from GPX files found in CC.
The resulting multimodal dataset includes 1,416 pairings of human-written descriptions and MultiLineString vector data from the 6 most recent CC releases.
- Score: 0.07499722271664144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Common Crawl (CC) corpus is the largest open web crawl dataset containing 9.5+ petabytes of data captured since 2008. The dataset is instrumental in training large language models, and as such it has been studied for (un)desirable content, and distilled for smaller, domain-specific datasets. However, to our knowledge, no research has been dedicated to using CC as a source of annotated geospatial data. In this paper, we introduce an efficient pipeline to extract annotated user-generated tracks from GPX files found in CC, and the resulting multimodal dataset with 1,416 pairings of human-written descriptions and MultiLineString vector data from the 6 most recent CC releases. The dataset can be used to study people's outdoor activity patterns, the way people talk about their outdoor experiences, and for developing trajectory generation or track annotation models. Our reproducible code is available on GitHub: https://github.com/ilyankou/cc-gpx
Related papers
- Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions [53.069446715005924]
Graph-based captioning (GBC) describes an image using a labelled graph structure.
nodes in GBC are created using, in a first stage, object detection and dense captioning tools.
We show that using GBC nodes' annotations results in significant performance boost on downstream models.
arXiv Detail & Related papers (2024-07-09T09:55:04Z) - Quantifying Geospatial in the Common Crawl Corpus [0.07499722271664144]
This paper investigates the prevalence of geospatial data in Common Crawl releases using Gemini, a powerful language model.
We estimate that between 1 in 5 and 1 in 6 documents contain geospatial information such as coordinates and street addresses.
arXiv Detail & Related papers (2024-06-07T14:16:37Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - trajdata: A Unified Interface to Multiple Human Trajectory Datasets [32.93180256927027]
We present trajdata, a unified interface to multiple human trajectory datasets.
Trajdata provides a simple, uniform, and efficient representation and API for trajectory and map data.
arXiv Detail & Related papers (2023-07-26T02:45:59Z) - GeoDE: a Geographically Diverse Evaluation Dataset for Object
Recognition [31.194474203667042]
GeoDE is a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions.
We release the full dataset and code at https://geodiverse-data-collection.cs.princeton.edu/.
arXiv Detail & Related papers (2023-01-05T18:21:50Z) - AutoGeoLabel: Automated Label Generation for Geospatial Machine Learning [69.47585818994959]
We evaluate a big data processing pipeline to auto-generate labels for remote sensing data.
We utilize the big geo-data platform IBM PAIRS to dynamically generate such labels in dense urban areas.
arXiv Detail & Related papers (2022-01-31T20:02:22Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - A Large Dataset of Historical Japanese Documents with Complex Layouts [5.343406649012619]
HJDataset is a large dataset of historical Japanese documents with complex layouts.
It contains over 250,000 layout element annotations seven types.
A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors.
arXiv Detail & Related papers (2020-04-18T18:38:25Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.