A Survey on Data Selection for LLM Instruction Tuning
- URL: http://arxiv.org/abs/2402.05123v1
- Date: Sun, 4 Feb 2024 13:32:01 GMT
- Title: A Survey on Data Selection for LLM Instruction Tuning
- Authors: Jiahao Wang, Bolin Zhang, Qianlong Du, Jiajun Zhang, Dianhui Chu
- Abstract summary: We propose a new taxonomy of the data selection methods and provide a detailed introduction of recent advances.
We emphasize the open challenges and present new frontiers of this task.
- Score: 18.94987580516951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction tuning is a vital step of training large language models (LLM),
so how to enhance the effect of instruction tuning has received increased
attention. Existing works indicate that the quality of the dataset is more
crucial than the quantity during instruction tuning of LLM. Therefore, recently
a lot of studies focus on exploring the methods of selecting high-quality
subset from instruction datasets, aiming to reduce training costs and enhance
the instruction-following capabilities of LLMs. This paper presents a
comprehensive survey on data selection for LLM instruction tuning. Firstly, we
introduce the wildly used instruction datasets. Then, we propose a new taxonomy
of the data selection methods and provide a detailed introduction of recent
advances,and the evaluation strategies and results of data selection methods
are also elaborated in detail. Finally, we emphasize the open challenges and
present new frontiers of this task.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs)
LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z) - Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning [13.535110749767451]
We propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG)
Our method computes the information gain on masked parts to dynamically replay data and refine the training objective.
Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
arXiv Detail & Related papers (2024-03-15T06:54:20Z) - Large Language Models for Data Annotation: A Survey [49.8318827245266]
The emergence of advanced Large Language Models (LLMs) presents an unprecedented opportunity to automate the complicated process of data annotation.
This survey includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation.
arXiv Detail & Related papers (2024-02-21T00:44:04Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - MoDS: Model-oriented Data Selection for Instruction Tuning [35.60124047070829]
We present a model-oriented data selection (MoDS) approach, which selects instruction data based on a new criteria considering three aspects: quality, coverage and necessity.
Experimental results show that, the model fine-tuned with 4,000 instruction pairs selected by our approach could perform better than the model fine-tuned with the full original dataset.
arXiv Detail & Related papers (2023-11-27T09:33:13Z) - Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning.
We propose a novel method, termed "reflection-tuning"
This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z) - Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low
Training Data Instruction Tuning [13.558918552284906]
This paper focuses on reducing the data used in instruction tuning for large language models (LLMs) to decrease training costs and improve data efficiency.
The results suggest that task-specific models can be trained using less than 0.5% of the original dataset, with a 2% improvement in performance over those trained on full task-related data.
arXiv Detail & Related papers (2023-05-16T07:52:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.