Datasets for Large Language Models: A Comprehensive Survey
- URL: http://arxiv.org/abs/2402.18041v1
- Date: Wed, 28 Feb 2024 04:35:51 GMT
- Title: Datasets for Large Language Models: A Comprehensive Survey
- Authors: Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, Lianwen Jin
- Abstract summary: The survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives.
The survey sheds light on the prevailing challenges and points out potential avenues for future investigation.
The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets.
- Score: 37.153302283062004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper embarks on an exploration into the Large Language Model (LLM)
datasets, which play a crucial role in the remarkable advancements of LLMs. The
datasets serve as the foundational infrastructure analogous to a root system
that sustains and nurtures the development of LLMs. Consequently, examination
of these datasets emerges as a critical topic in research. In order to address
the current lack of a comprehensive overview and thorough analysis of LLM
datasets, and to gain insights into their current status and future trends,
this survey consolidates and categorizes the fundamental aspects of LLM
datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction
Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5)
Traditional Natural Language Processing (NLP) Datasets. The survey sheds light
on the prevailing challenges and points out potential avenues for future
investigation. Additionally, a comprehensive review of the existing available
dataset resources is also provided, including statistics from 444 datasets,
covering 8 language categories and spanning 32 domains. Information from 20
dimensions is incorporated into the dataset statistics. The total data size
surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for
other datasets. We aim to present the entire landscape of LLM text datasets,
serving as a comprehensive reference for researchers in this field and
contributing to future studies. Related resources are available at:
https://github.com/lmmlzn/Awesome-LLMs-Datasets.
Related papers
- The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track [1.5993707490601146]
This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation.
We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit.
Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management.
arXiv Detail & Related papers (2024-10-29T19:07:50Z) - Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning [3.623224034411137]
offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems.
Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results.
We show how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets.
arXiv Detail & Related papers (2024-09-18T14:13:24Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Large Language Models for Data Annotation: A Survey [49.8318827245266]
The emergence of advanced Large Language Models (LLMs) presents an unprecedented opportunity to automate the complicated process of data annotation.
This survey includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation.
arXiv Detail & Related papers (2024-02-21T00:44:04Z) - Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face.
Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z) - TeleQnA: A Benchmark Dataset to Assess Large Language Models
Telecommunications Knowledge [26.302396162473293]
TeleQnA is the first benchmark dataset designed to evaluate the knowledge of Large Language Models (LLMs) in telecommunications.
This paper outlines the automated question generation framework responsible for creating this dataset, along with how human input was integrated at various stages to ensure the quality of the questions.
The dataset has been made publicly accessible on GitHub.
arXiv Detail & Related papers (2023-10-23T15:55:15Z) - LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [75.9621305227523]
We introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art large language models (LLMs)
This dataset is collected from 210K IP addresses in the wild on our Vicuna demo and Arena website.
We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.
arXiv Detail & Related papers (2023-09-21T12:13:55Z) - Large Language Models as Data Preprocessors [9.99065004972981]
Large Language Models (LLMs) have marked a significant advancement in artificial intelligence.
This study explores their potential in data preprocessing, a critical stage in data mining and analytics applications.
We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques.
arXiv Detail & Related papers (2023-08-30T23:28:43Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.