Data Management For Large Language Models: A Survey
- URL: http://arxiv.org/abs/2312.01700v2
- Date: Tue, 26 Dec 2023 01:35:38 GMT
- Title: Data Management For Large Language Models: A Survey
- Authors: Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang,
Lifeng Shang, Xin Jiang, Qun Liu
- Abstract summary: Data plays a fundamental role in the training of Large Language Models (LLMs)
This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs.
- Score: 66.59562797566163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data plays a fundamental role in the training of Large Language Models
(LLMs). Effective data management, particularly in the formulation of a
well-suited training dataset, holds significance for enhancing model
performance and improving training efficiency during pretraining and supervised
fine-tuning phases. Despite the considerable importance of data management, the
current research community still falls short in providing a systematic analysis
of the rationale behind management strategy selection, its consequential
effects, methodologies for evaluating curated datasets, and the ongoing pursuit
of improved strategies. Consequently, the exploration of data management has
attracted more and more attention among the research community. This survey
provides a comprehensive overview of current research in data management within
both the pretraining and supervised fine-tuning stages of LLMs, covering
various noteworthy aspects of data management strategy design: data quantity,
data quality, domain/task composition, etc. Looking toward the future, we
extrapolate existing challenges and outline promising directions for
development in this field. Therefore, this survey serves as a guiding resource
for practitioners aspiring to construct powerful LLMs through effective data
management practices. The collection of the latest papers is available at
https://github.com/ZigeW/data_management_LLM.
Related papers
- Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Recent Advances in Data-Driven Business Process Management [0.0]
The rapid development of cutting-edge technologies has led to a paradigm shift in data-based management and decision-making.
Data-driven business process management has become a relevant and vibrant research area.
arXiv Detail & Related papers (2024-06-03T21:05:59Z) - Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges [47.45993726498343]
Data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection.
This survey explores the transformative impact of large language models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond.
arXiv Detail & Related papers (2024-03-05T14:11:54Z) - A Survey on Data Selection for Language Models [151.6210632830082]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Exploring Data Management Challenges and Solutions in Agile Software Development: A Literature Review and Practitioner Survey [4.45543024542181]
Managing data related to a software product and its development poses significant challenges for software projects and agile development teams.
Challenges include integrating data from diverse sources and ensuring data quality in light of continuous change and adaptation.
arXiv Detail & Related papers (2024-02-01T10:07:12Z) - CUDC: A Curiosity-Driven Unsupervised Data Collection Method with
Adaptive Temporal Distances for Offline Reinforcement Learning [62.58375643251612]
We propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection.
With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity.
Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.
arXiv Detail & Related papers (2023-12-19T14:26:23Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Optimizing the AI Development Process by Providing the Best Support
Environment [0.756282840161499]
Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance.
The framework was built using python language to perform data augmentation using deep learning advancements.
arXiv Detail & Related papers (2023-04-29T00:44:50Z) - DMOps: Data Management Operation and Recipes [2.28438857884398]
Data-centric AI has shed light on the significance of data within the machine learning (ML) pipeline.
We propose a "Data Management Operations and Recipes" to guide the industry in optimizing the building of datasets for NLP products.
arXiv Detail & Related papers (2023-01-02T09:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.