Related papers: Data Management For Large Language Models: A Survey

Data Management For Large Language Models: A Survey

URL: http://arxiv.org/abs/2312.01700v2
Date: Tue, 26 Dec 2023 01:35:38 GMT
Title: Data Management For Large Language Models: A Survey
Authors: Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu
Abstract summary: Data plays a fundamental role in the training of Large Language Models (LLMs) This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs.
Score: 66.59562797566163
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data plays a fundamental role in the training of Large Language Models (LLMs). Effective data management, particularly in the formulation of a well-suited training dataset, holds significance for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning phases. Despite the considerable importance of data management, the current research community still falls short in providing a systematic analysis of the rationale behind management strategy selection, its consequential effects, methodologies for evaluating curated datasets, and the ongoing pursuit of improved strategies. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various noteworthy aspects of data management strategy design: data quantity, data quality, domain/task composition, etc. Looking toward the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through effective data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM.

Related papers

Data Efficacy for Language Model Training [29.901090317084005]
Data is fundamental to the training of language models (LM)<n>Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data.<n>This work introduces a general paradigm, DELT, for considering data efficacy in LM training.
arXiv Detail & Related papers (2025-06-26T17:59:07Z)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
A Practical Guide to Fine-tuning Language Models with Limited Data [9.413178499853156]
Employing pre-trained Large Language Models (LLMs) has become the de facto standard in Natural Language Processing (NLP) despite their extensive data requirements. Motivated by the recent surge in research focused on training LLMs with limited data, this paper surveys recent transfer learning approaches to optimize model performance in downstream tasks where data is scarce.
arXiv Detail & Related papers (2024-11-14T15:55:37Z)
Data Proportion Detection for Optimized Data Management for Large Language Models [32.62631669919273]
We introduce a new topic, textitdata proportion detection, which enables the automatic estimation of pre-training data proportions. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection.
arXiv Detail & Related papers (2024-09-26T04:30:32Z)
Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges [47.45993726498343]
Data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of large language models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond.
arXiv Detail & Related papers (2024-03-05T14:11:54Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning [62.58375643251612]
We propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.
arXiv Detail & Related papers (2023-12-19T14:26:23Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
Optimizing the AI Development Process by Providing the Best Support Environment [0.756282840161499]
Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance. The framework was built using python language to perform data augmentation using deep learning advancements.
arXiv Detail & Related papers (2023-04-29T00:44:50Z)
DMOps: Data Management Operation and Recipes [2.28438857884398]
Data-centric AI has shed light on the significance of data within the machine learning (ML) pipeline. We propose a "Data Management Operations and Recipes" to guide the industry in optimizing the building of datasets for NLP products.
arXiv Detail & Related papers (2023-01-02T09:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.