DMOps: Data Management Operation and Recipes
- URL: http://arxiv.org/abs/2301.01228v3
- Date: Mon, 26 Jun 2023 01:23:05 GMT
- Title: DMOps: Data Management Operation and Recipes
- Authors: Eujeong Choi, Chanjun Park
- Abstract summary: Data-centric AI has shed light on the significance of data within the machine learning (ML) pipeline.
We propose a "Data Management Operations and Recipes" to guide the industry in optimizing the building of datasets for NLP products.
- Score: 2.28438857884398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-centric AI has shed light on the significance of data within the machine
learning (ML) pipeline. Recognizing its significance, academia, industry, and
government departments have suggested various NLP data research initiatives.
While the ability to utilize existing data is essential, the ability to build a
dataset has become more critical than ever, especially in the industry. In
consideration of this trend, we propose a "Data Management Operations and
Recipes" to guide the industry in optimizing the building of datasets for NLP
products. This paper presents the concept of DMOps which is derived from
real-world experiences with NLP data management and aims to streamline data
operations by offering a baseline.
Related papers
- Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Data Management For Large Language Models: A Survey [66.59562797566163]
Data plays a fundamental role in the training of Large Language Models (LLMs)
This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs.
arXiv Detail & Related papers (2023-12-04T07:42:16Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Optimizing the AI Development Process by Providing the Best Support
Environment [0.756282840161499]
Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance.
The framework was built using python language to perform data augmentation using deep learning advancements.
arXiv Detail & Related papers (2023-04-29T00:44:50Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Deep Reinforcement Learning Assisted Federated Learning Algorithm for
Data Management of IIoT [82.33080550378068]
The continuous expanded scale of the industrial Internet of Things (IIoT) leads to IIoT equipments generating massive amounts of user data every moment.
How to manage these time series data in an efficient and safe way in the field of IIoT is still an open issue.
This paper studies the FL technology applications to manage IIoT equipment data in wireless network environments.
arXiv Detail & Related papers (2022-02-03T07:12:36Z) - Big Machinery Data Preprocessing Methodology for Data-Driven Models in
Prognostics and Health Management [0.0]
This paper presents a comprehensive, step-by-step pipeline for the preprocessing of monitoring data from complex systems.
The importance of expert knowledge is discussed in the context of data selection and label generation.
Two case studies are presented for validation, with the end goal of creating clean data sets with healthy and unhealthy labels.
arXiv Detail & Related papers (2021-10-08T17:10:12Z) - On The State of Data In Computer Vision: Human Annotations Remain
Indispensable for Developing Deep Learning Models [0.0]
High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML)
Since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant.
Only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet.
arXiv Detail & Related papers (2021-07-31T00:08:21Z) - An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks.
Data augmentation methods have been explored as a means of improving data efficiency in NLP.
We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.