DMOps: Data Management Operation and Recipes
- URL: http://arxiv.org/abs/2301.01228v3
- Date: Mon, 26 Jun 2023 01:23:05 GMT
- Title: DMOps: Data Management Operation and Recipes
- Authors: Eujeong Choi, Chanjun Park
- Abstract summary: Data-centric AI has shed light on the significance of data within the machine learning (ML) pipeline.
We propose a "Data Management Operations and Recipes" to guide the industry in optimizing the building of datasets for NLP products.
- Score: 2.28438857884398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-centric AI has shed light on the significance of data within the machine
learning (ML) pipeline. Recognizing its significance, academia, industry, and
government departments have suggested various NLP data research initiatives.
While the ability to utilize existing data is essential, the ability to build a
dataset has become more critical than ever, especially in the industry. In
consideration of this trend, we propose a "Data Management Operations and
Recipes" to guide the industry in optimizing the building of datasets for NLP
products. This paper presents the concept of DMOps which is derived from
real-world experiences with NLP data management and aims to streamline data
operations by offering a baseline.
Related papers
- The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track [1.5993707490601146]
This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation.
We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit.
Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management.
arXiv Detail & Related papers (2024-10-29T19:07:50Z) - Data Proportion Detection for Optimized Data Management for Large Language Models [32.62631669919273]
We introduce a new topic, textitdata proportion detection, which enables the automatic estimation of pre-training data proportions.
We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection.
arXiv Detail & Related papers (2024-09-26T04:30:32Z) - Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning [3.623224034411137]
offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems.
Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results.
We show how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets.
arXiv Detail & Related papers (2024-09-18T14:13:24Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Data Management For Training Large Language Models: A Survey [64.18200694790787]
Data plays a fundamental role in training Large Language Models (LLMs)
This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs.
arXiv Detail & Related papers (2023-12-04T07:42:16Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Optimizing the AI Development Process by Providing the Best Support
Environment [0.756282840161499]
Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance.
The framework was built using python language to perform data augmentation using deep learning advancements.
arXiv Detail & Related papers (2023-04-29T00:44:50Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Deep Reinforcement Learning Assisted Federated Learning Algorithm for
Data Management of IIoT [82.33080550378068]
The continuous expanded scale of the industrial Internet of Things (IIoT) leads to IIoT equipments generating massive amounts of user data every moment.
How to manage these time series data in an efficient and safe way in the field of IIoT is still an open issue.
This paper studies the FL technology applications to manage IIoT equipment data in wireless network environments.
arXiv Detail & Related papers (2022-02-03T07:12:36Z) - An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks.
Data augmentation methods have been explored as a means of improving data efficiency in NLP.
We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.