Related papers: DMOps: Data Management Operation and Recipes

Related papers

Data Science and Technology Towards AGI Part I: Tiered Data Management [53.64581824953229]
We argue that the development of artificial intelligence is entering a new phase of data-model co-evolution.<n>We introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge.<n>We validate the effectiveness of the proposed framework through empirical studies.
arXiv Detail & Related papers (2026-02-09T18:47:51Z)
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z)
A Survey on Efficient Large Language Model Training: From Data-centric Perspectives [42.897899343082806]
We present the first systematic survey of data-efficient Large Language Models post-training from a data-centric perspective.<n>We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems.<n>We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training.
arXiv Detail & Related papers (2025-10-29T17:01:55Z)
More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning [47.13636836547429]
We conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning.<n>Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume.
arXiv Detail & Related papers (2025-10-08T16:07:26Z)
The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track [1.5993707490601146]
This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management.
arXiv Detail & Related papers (2024-10-29T19:07:50Z)
Data Proportion Detection for Optimized Data Management for Large Language Models [32.62631669919273]
We introduce a new topic, textitdata proportion detection, which enables the automatic estimation of pre-training data proportions. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection.
arXiv Detail & Related papers (2024-09-26T04:30:32Z)
Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning [3.623224034411137]
offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We show how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets.
arXiv Detail & Related papers (2024-09-18T14:13:24Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
Data Management For Training Large Language Models: A Survey [64.18200694790787]
Data plays a fundamental role in training Large Language Models (LLMs) This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs.
arXiv Detail & Related papers (2023-12-04T07:42:16Z)
Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets. We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z)
Optimizing the AI Development Process by Providing the Best Support Environment [0.756282840161499]
Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance. The framework was built using python language to perform data augmentation using deep learning advancements.
arXiv Detail & Related papers (2023-04-29T00:44:50Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Deep Reinforcement Learning Assisted Federated Learning Algorithm for Data Management of IIoT [82.33080550378068]
The continuous expanded scale of the industrial Internet of Things (IIoT) leads to IIoT equipments generating massive amounts of user data every moment. How to manage these time series data in an efficient and safe way in the field of IIoT is still an open issue. This paper studies the FL technology applications to manage IIoT equipment data in wireless network environments.
arXiv Detail & Related papers (2022-02-03T07:12:36Z)
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks. Data augmentation methods have been explored as a means of improving data efficiency in NLP. We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.