Data Commons
- URL: http://arxiv.org/abs/2309.13054v1
- Date: Fri, 8 Sep 2023 00:14:09 GMT
- Title: Data Commons
- Authors: Ramanathan V. Guha, Prashanth Radhakrishnan, Bo Xu, Wei Sun, Carolyn
Au, Ajai Tirumali, Muhammad J. Amjad, Samantha Piekos, Natalie Diaz, Jennifer
Chen, Julia Wu, Prem Ramaswami, James Manyika
- Abstract summary: Data Commons is a distributed network of sites that publish data in a common schema and interoperate using the Data Commons APIs.
This paper describes the architecture of Data Commons, some of the major deployments and highlights directions for future work.
- Score: 4.568270630281101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Publicly available data from open sources (e.g., United States Census Bureau
(Census), World Health Organization (WHO), Intergovernmental Panel on Climate
Change (IPCC)) are vital resources for policy makers, students and researchers
across different disciplines. Combining data from different sources requires
the user to reconcile the differences in schemas, formats, assumptions, and
more. This data wrangling is time consuming, tedious and needs to be repeated
by every user of the data. Our goal with Data Commons (DC) is to help make
public data accessible and useful to those who want to understand this data and
use it to solve societal challenges and opportunities. We do the data
processing and make the processed data widely available via standard schemas
and Cloud APIs. Data Commons is a distributed network of sites that publish
data in a common schema and interoperate using the Data Commons APIs. Data from
different Data Commons can be joined easily. The aggregate of these Data
Commons can be viewed as a single Knowledge Graph. This Knowledge Graph can
then be searched over using Natural Language questions utilizing advances in
Large Language Models. This paper describes the architecture of Data Commons,
some of the major deployments and highlights directions for future work.
Related papers
- Brazil Data Commons: A Platform for Unifying and Integrating Brazil's Public Data [0.3322570886790747]
Brazil Data Commons is a platform that unifies various Brazilian datasets under a common semantic framework.<n>By adopting globally recognized and interoperable data standards, Brazil Data Commons aligns with the principles of the broader Data Commons ecosystem.
arXiv Detail & Related papers (2025-11-13T20:18:20Z) - Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training [6.00143998001152]
We introduce Common Corpus, the largest open dataset for language model pre-training.<n>The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets.
arXiv Detail & Related papers (2025-06-02T14:43:15Z) - RedStone: Curating General, Code, Math, and QA Data for Large Language Models [134.49774529790693]
This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training Large Language Models.
We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl.
arXiv Detail & Related papers (2024-12-04T15:27:39Z) - How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages [0.0]
European Union's General Data Protection Regulation strengthened data subjects' right to access personal data.
Subjects' possibilities for actually using controller-provided subject access request packages (SARPs) are severely limited so far.
This dataset is publicly provided and shall, in the future, serve as a starting point for researching and comparing novel approaches for practically viable use of SARPs.
arXiv Detail & Related papers (2024-07-05T12:39:51Z) - Large-Scale Multipurpose Benchmark Datasets For Assessing Data-Driven Deep Learning Approaches For Water Distribution Networks [41.94295877935867]
This work provides a collection of datasets that includes several small and medium size publicly available Water Distribution Networks (WDNs)
In total 1,394,400 hours of WDNs data operating under normal conditions is made available to the community.
arXiv Detail & Related papers (2024-04-23T11:58:40Z) - Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs [53.03085605769093]
We propose to learn Federated Neural Graph DataBase (FedNGDB), a pioneering systematic framework that empowers privacy-preserving reasoning over multi-source graph data.<n>FedNGDB leverages federated learning to collaboratively learn graph representations across multiple sources, enriching relationships between entities, and improving the overall quality of graph data.
arXiv Detail & Related papers (2024-02-22T14:57:44Z) - On the development of an application for the compilation of global sea
level changes [0.0]
The presented solution is to develop a web application that solves some of the issues faced by researchers.
The application also assists with data querying, processing and visualization by making tables, showing maps and drawing graphs.
arXiv Detail & Related papers (2024-02-04T18:45:33Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Packaging code for reproducible research in the public sector [0.0]
jtstats project consists of R and Python packages for importing, processing, and visualising large and complex datasets.
Jtstats shows how domain specific packages can enable reproducible research within the public sector and beyond.
arXiv Detail & Related papers (2023-05-25T16:07:24Z) - A Multi-Format Transfer Learning Model for Event Argument Extraction via
Variational Information Bottleneck [68.61583160269664]
Event argument extraction (EAE) aims to extract arguments with given roles from texts.
We propose a multi-format transfer learning model with variational information bottleneck.
We conduct extensive experiments on three benchmark datasets, and obtain new state-of-the-art performance on EAE.
arXiv Detail & Related papers (2022-08-27T13:52:01Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - HeteroQA: Learning towards Question-and-Answering through Multiple
Information Sources via Heterogeneous Graph Modeling [50.39787601462344]
Community Question Answering (CQA) is a well-defined task that can be used in many scenarios, such as E-Commerce and online user community for special interests.
Most of the CQA methods only incorporate articles or Wikipedia to extract knowledge and answer the user's question.
We propose a question-aware heterogeneous graph transformer to incorporate the multiple information sources (MIS) in the user community to automatically generate the answer.
arXiv Detail & Related papers (2021-12-27T10:16:43Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.