Open Data on GitHub: Unlocking the Potential of AI
- URL: http://arxiv.org/abs/2306.06191v1
- Date: Fri, 9 Jun 2023 18:43:26 GMT
- Title: Open Data on GitHub: Unlocking the Potential of AI
- Authors: Anthony Cintron Roman, Kevin Xu, Arfon Smith, Jehu Torres Vega, Caleb
Robinson, Juan M Lavista Ferres
- Abstract summary: GitHub is the world's largest platform for collaborative software development, with over 100 million users.
This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research.
- Score: 2.3324945410076685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: GitHub is the world's largest platform for collaborative software
development, with over 100 million users. GitHub is also used extensively for
open data collaboration, hosting more than 800 million open data files,
totaling 142 terabytes of data. This study highlights the potential of open
data on GitHub and demonstrates how it can accelerate AI research. We analyze
the existing landscape of open data on GitHub and the patterns of how users
share datasets. Our findings show that GitHub is one of the largest hosts of
open data in the world and has experienced an accelerated growth of open data
assets over the past four years. By examining the open data landscape on
GitHub, we aim to empower users and organizations to leverage existing open
datasets and improve their discoverability -- ultimately contributing to the
ongoing AI revolution to help address complex societal issues. We release the
three datasets that we have collected to support this analysis as open datasets
at https://github.com/github/open-data-on-github.
Related papers
- A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI [0.0]
Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge.
This white paper seeks to unpack the relationship between open data and generative AI and explore possible components of a new Fourth Wave of Open Data.
arXiv Detail & Related papers (2024-05-07T14:01:33Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and
Regulatory Norms [58.93352076927003]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - The All-Seeing Project: Towards Panoptic Visual Recognition and
Understanding of the Open World [71.52132776748628]
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.
We create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions.
We develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding.
arXiv Detail & Related papers (2023-08-03T17:59:47Z) - Synthcity: facilitating innovative use cases of synthetic data in
different data modalities [86.52703093858631]
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation.
Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data.
arXiv Detail & Related papers (2023-01-18T14:49:54Z) - The OCEAN mailing list data set: Network analysis spanning mailing lists
and code repositories [0.0]
We combine and standardize mailing lists of the Python community, resulting in 954,287 messages from 1995 to the present.
To showcase the usefulness of these data, we focus on the CPython repository and merge the technical layer with the social layer.
We discuss how these data provide a laboratory to test theories from standard organizational science in large open source projects.
arXiv Detail & Related papers (2022-04-01T17:50:15Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic
Full Waveform Inversion [16.117689670474142]
Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data.
Recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community.
We present OpenFWI, a collection of large-scale multi-structural benchmark datasets.
arXiv Detail & Related papers (2021-11-04T15:03:40Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - The penumbra of open source: projects outside of centralized platforms
are longer maintained, more academic and more collaborative [0.0]
We develop a novel, extensive sample of public open source project repositories outside of centralized platforms.
Our sample projects tend to have more collaborators, are maintained for longer periods, and tend to be more focused on academic and scientific problems.
arXiv Detail & Related papers (2021-06-29T17:54:26Z) - Data Engineering for Everyone [1.2585165426919136]
Data engineering is one of the fastest-growing fields within machine learning (ML)
ML requires more data than individual teams of data engineers can readily produce.
This article shows that open-source data sets are the rocket fuel for research and innovation at even some of the largest AI organizations.
arXiv Detail & Related papers (2021-02-23T01:24:37Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.