Instruction Mining: When Data Mining Meets Large Language Model
Finetuning
- URL: http://arxiv.org/abs/2307.06290v2
- Date: Fri, 27 Oct 2023 20:53:22 GMT
- Title: Instruction Mining: When Data Mining Meets Large Language Model
Finetuning
- Authors: Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun
- Abstract summary: InstructMining is designed for automatically selecting premium instruction-following data for finetuning large language models.
We show that InstructMining achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.
- Score: 20.077359677828426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are initially pretrained for broad capabilities
and then finetuned with instruction-following datasets to improve their
performance in interacting with humans. Despite advances in finetuning, a
standardized guideline for selecting high-quality datasets to optimize this
process remains elusive. In this paper, we first propose InstructMining, an
innovative method designed for automatically selecting premium
instruction-following data for finetuning LLMs. Specifically, InstructMining
utilizes natural language indicators as a measure of data quality, applying
them to evaluate unseen datasets. During experimentation, we discover that
double descent phenomenon exists in large language model finetuning. Based on
this observation, we further leverage BlendSearch to help find the best subset
among the entire dataset (i.e., 2,532 out of 100,000). Experiment results show
that InstructMining-7B achieves state-of-the-art performance on two of the most
popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.
Related papers
- Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [90.4820014819937]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Improving Sentence Embeddings with an Automatically Generated NLI
Dataset [15.235687410343171]
Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing.
We aim to improve sentence embeddings learned in an unsupervised setting by automatically generating an NLI dataset.
In experiments on STS tasks, the proposed method achieved an average Spearman's rank correlation coefficient of 82.21 with respect to human evaluation.
arXiv Detail & Related papers (2024-02-23T06:33:51Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms [2.249916681499244]
We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples.
We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation.
arXiv Detail & Related papers (2023-11-22T03:37:01Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data.
Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.