Related papers: Open Data on GitHub: Unlocking the Potential of AI

Open Data on GitHub: Unlocking the Potential of AI

URL: http://arxiv.org/abs/2306.06191v1
Date: Fri, 9 Jun 2023 18:43:26 GMT
Title: Open Data on GitHub: Unlocking the Potential of AI
Authors: Anthony Cintron Roman, Kevin Xu, Arfon Smith, Jehu Torres Vega, Caleb Robinson, Juan M Lavista Ferres
Abstract summary: GitHub is the world's largest platform for collaborative software development, with over 100 million users. This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research.
Score: 2.3324945410076685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: GitHub is the world's largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes of data. This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research. We analyze the existing landscape of open data on GitHub and the patterns of how users share datasets. Our findings show that GitHub is one of the largest hosts of open data in the world and has experienced an accelerated growth of open data assets over the past four years. By examining the open data landscape on GitHub, we aim to empower users and organizations to leverage existing open datasets and improve their discoverability -- ultimately contributing to the ongoing AI revolution to help address complex societal issues. We release the three datasets that we have collected to support this analysis as open datasets at https://github.com/github/open-data-on-github.

Related papers

GitHub Proxy Server: A tool for supporting massive data collection on GitHub [0.0]
GitHub is the most popular social coding platform and widely used by developers and organizations to host their open-source projects around the world.<n>The platform has a web API that allow developers collect information from public repositories hosted on it.<n>However, collecting massive amount of data from GitHub can be very challenging due to existing restrictions and abuse detection mechanisms.<n>We present a tool, called GitHub Proxy Server, which abstracts such complexities into a tool that is independent on operational system and programming language.
arXiv Detail & Related papers (2025-05-23T19:00:32Z)
SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing [13.717170962455526]
We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories. Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform. After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
arXiv Detail & Related papers (2024-09-27T11:42:19Z)
OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing. OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services. We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z)
A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI [0.0]
Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge. This white paper seeks to unpack the relationship between open data and generative AI and explore possible components of a new Fourth Wave of Open Data.
arXiv Detail & Related papers (2024-05-07T14:01:33Z)
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World [71.52132776748628]
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. We create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. We develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding.
arXiv Detail & Related papers (2023-08-03T17:59:47Z)
Synthcity: facilitating innovative use cases of synthetic data in different data modalities [86.52703093858631]
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation. Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data.
arXiv Detail & Related papers (2023-01-18T14:49:54Z)
The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories [0.0]
We combine and standardize mailing lists of the Python community, resulting in 954,287 messages from 1995 to the present. To showcase the usefulness of these data, we focus on the CPython repository and merge the technical layer with the social layer. We discuss how these data provide a laboratory to test theories from standard organizational science in large open source projects.
arXiv Detail & Related papers (2022-04-01T17:50:15Z)
DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data. toolname has features for dataset recommendation and global vision analysis. So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z)
OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic Full Waveform Inversion [16.117689670474142]
Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data. Recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community. We present OpenFWI, a collection of large-scale multi-structural benchmark datasets.
arXiv Detail & Related papers (2021-11-04T15:03:40Z)
Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP. The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z)
The penumbra of open source: projects outside of centralized platforms are longer maintained, more academic and more collaborative [0.0]
We develop a novel, extensive sample of public open source project repositories outside of centralized platforms. Our sample projects tend to have more collaborators, are maintained for longer periods, and tend to be more focused on academic and scientific problems.
arXiv Detail & Related papers (2021-06-29T17:54:26Z)
Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research. OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains. For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.