Diversity Measurement and Subset Selection for Instruction Tuning
Datasets
- URL: http://arxiv.org/abs/2402.02318v1
- Date: Sun, 4 Feb 2024 02:09:43 GMT
- Title: Diversity Measurement and Subset Selection for Instruction Tuning
Datasets
- Authors: Peiqi Wang, Yikang Shen, Zhen Guo, Matthew Stallone, Yoon Kim, Polina
Golland, Rameswar Panda
- Abstract summary: We use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection.
We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and a maximally diverse reference dataset.
- Score: 40.930387018872786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We aim to select data subsets for the fine-tuning of large language models to
more effectively follow instructions. Prior work has emphasized the importance
of diversity in dataset curation but relied on heuristics such as the number of
tasks. In this paper, we use determinantal point processes to capture the
diversity and quality of instruction tuning datasets for subset selection. We
propose to measure dataset diversity with log determinant distance that is the
distance between the dataset of interest and a maximally diverse reference
dataset. Our experiments demonstrate that the proposed diversity measure in the
normalized weight gradient space is correlated with downstream
instruction-following performance. Consequently, it can be used to inform when
data selection is the most helpful and to analyze dataset curation strategies.
We demonstrate the utility of our approach on various instruction tuning
datasets.
Related papers
- The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph [45.51085356985464]
We introduce GraphFilter, a novel method that represents the dataset as a bipartite graph, linking sentences to their constituent n-grams.
This representation effectively captures the relationships between sentences and linguistic patterns, facilitating the selection of sentences that enhance n-gram diversity.
GraphFilter iteratively selects high-priority sentences, updates the bipartite graph by removing covered n-grams, and re-calculates priorities to reflect the evolving data landscape.
arXiv Detail & Related papers (2024-10-16T11:16:34Z) - Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement [8.509688686402438]
Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities.
This work addresses the question: How can we determine the optimal subset of data for effective training?
Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset.
arXiv Detail & Related papers (2024-09-17T17:25:31Z) - Feature Selection from Differentially Private Correlations [35.187113265093615]
High-dimensional regression can leak information about individual datapoints in a dataset.
We employ a correlations-based order statistic to choose important features from a dataset and privatize them.
We find that our method significantly outperforms the established baseline for private feature selection on many datasets.
arXiv Detail & Related papers (2024-08-20T13:54:07Z) - TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data [29.45013725650798]
It is essential to extract a subset of instruction datasets that achieves comparable performance to the full dataset.
We propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS)
Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection.
arXiv Detail & Related papers (2024-07-21T17:59:20Z) - Multi-Teacher Multi-Objective Meta-Learning for Zero-Shot Hyperspectral Band Selection [50.30291173608449]
We propose a novel multi-objective meta-learning network (M$3$BS) for zero-shot hyperspectral band selection.
In M$3$BS, a generalizable graph convolution network (GCN) is constructed to generate dataset-agnostic base.
The acquired meta-knowledge can be directly transferred to unseen datasets without any retraining or fine-tuning.
arXiv Detail & Related papers (2024-06-12T07:13:31Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution.
We extend the classic importance resampling approach used in low-dimensions for LM data selection.
We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.