Data Engineering for Everyone
- URL: http://arxiv.org/abs/2102.11447v1
- Date: Tue, 23 Feb 2021 01:24:37 GMT
- Title: Data Engineering for Everyone
- Authors: Vijay Janapa Reddi, Greg Diamos, Pete Warden, Peter Mattson, David
Kanter
- Abstract summary: Data engineering is one of the fastest-growing fields within machine learning (ML)
ML requires more data than individual teams of data engineers can readily produce.
This article shows that open-source data sets are the rocket fuel for research and innovation at even some of the largest AI organizations.
- Score: 1.2585165426919136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data engineering is one of the fastest-growing fields within machine learning
(ML). As ML becomes more common, the appetite for data grows more ravenous. But
ML requires more data than individual teams of data engineers can readily
produce, which presents a severe challenge to ML deployment at scale. Much like
the software-engineering revolution, where mass adoption of open-source
software replaced the closed, in-house development model for infrastructure
code, there is a growing need to enable rapid development and open contribution
to massive machine learning data sets. This article shows that open-source data
sets are the rocket fuel for research and innovation at even some of the
largest AI organizations. Our analysis of nearly 2000 research publications
from Facebook, Google and Microsoft over the past five years shows the
widespread use and adoption of open data sets. Open data sets that are easily
accessible to the public are vital to accelerating ML innovation for everyone.
But such open resources are scarce in the wild. So, what if we are able to
accelerate data-set creation via automatic data set generation tools?
Related papers
- OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI [0.0]
Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge.
This white paper seeks to unpack the relationship between open data and generative AI and explore possible components of a new Fourth Wave of Open Data.
arXiv Detail & Related papers (2024-05-07T14:01:33Z) - DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows [72.40917624485822]
We introduce DataDreamer, an open source Python library that allows researchers to implement powerful large language models.
DataDreamer also helps researchers adhere to best practices that we propose to encourage open science.
arXiv Detail & Related papers (2024-02-16T00:10:26Z) - OmniForce: On Human-Centered, Large Model Empowered and Cloud-Edge
Collaborative AutoML System [85.8338446357469]
We introduce OmniForce, a human-centered AutoML system that yields both human-assisted ML and ML-assisted human techniques.
We show how OmniForce can put an AutoML system into practice and build adaptive AI in open-environment scenarios.
arXiv Detail & Related papers (2023-03-01T13:35:22Z) - A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems.
ML models often remember' the old data.
Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z) - Open Environment Machine Learning [84.90891046882213]
Conventional machine learning studies assume close world scenarios where important factors of the learning process hold invariant.
This article briefly introduces some advances in this line of research, focusing on techniques concerning emerging new classes, decremental/incremental features, changing data distributions, varied learning objectives, and discusses some theoretical issues.
arXiv Detail & Related papers (2022-06-01T11:57:56Z) - What can Data-Centric AI Learn from Data and ML Engineering? [17.247372757533185]
Data-centric AI is a new and exciting research topic in the AI community.
Many organizations already build and maintain various "data-centric" applications.
We discuss several lessons from data and ML engineering that could be interesting to apply in data-centric AI.
arXiv Detail & Related papers (2021-12-13T06:40:05Z) - Data Collection and Quality Challenges in Deep Learning: A Data-Centric
AI Perspective [16.480530590466472]
Data-centric AI practices are now becoming mainstream.
Many datasets in the real world are small, dirty, biased, and even poisoned.
For data quality, we study data validation and data cleaning techniques.
arXiv Detail & Related papers (2021-12-13T03:57:36Z) - Widening Access to Applied Machine Learning with TinyML [1.1678513163359947]
We describe our pedagogical approach to increasing access to applied machine-learning (ML) through a massive open online course (MOOC) on Tiny Machine Learning (TinyML)
To this end, a collaboration between academia (Harvard University) and industry (Google) produced a four-part MOOC that provides application-oriented instruction on how to develop solutions using TinyML.
The series is openly available on the edX MOOC platform, has no prerequisites beyond basic programming, and is designed for learners from a global variety of backgrounds.
arXiv Detail & Related papers (2021-06-07T23:31:47Z) - Synthetic Data: Opening the data floodgates to enable faster, more
directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data.
Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community.
Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.