RECOST: External Knowledge Guided Data-efficient Instruction Tuning
- URL: http://arxiv.org/abs/2402.17355v1
- Date: Tue, 27 Feb 2024 09:47:36 GMT
- Title: RECOST: External Knowledge Guided Data-efficient Instruction Tuning
- Authors: Qi Zhang, Yiming Zhang, Haobo Wang, Junbo Zhao
- Abstract summary: We argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset.
We propose a framework dubbed as textbfRECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline.
- Score: 25.985023475991625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the current landscape of large language models (LLMs), the process of
instruction tuning serves as an essential step. Considering the high computing
power overhead, data-efficient instruction tuning was proposed to reduce the
training data size in this process, aiming at selecting high-quality
instructional data. Nevertheless, we argue that most current data-efficient
instruction-tuning methods are highly dependent on the quality of the original
instruction-tuning dataset. When it comes to datasets synthesized by LLMs, a
common scenario in this field, dirty samples will even be selected with a
higher probability than other samples. To address these challenges, we utilized
external knowledge (relevant examples or paragraphs) to evaluate those samples
synthesized by LLMs with an in-context-based relative predictive entropy. Based
on the new metric, we proposed a framework, dubbed as \textbf{RECOST}, which
integrates external-knowledge-base re-ranking and diversity-consistent sampling
into a single pipeline. Through extensive experiments on several synthetic
datasets (Alpaca and Alpaca-gpt4), we demonstrate the effectiveness of our
method and achieve even better results with only \textbf{1\%} of the full
dataset.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Generating Synthetic Datasets for Few-shot Prompt Tuning [48.10054761841462]
In few-shot learning settings, prompt tuning lags far behind full-model fine-tuning, limiting its scope of application.
In this paper, we leverage the powerful LLMs to synthesize task-specific labeled data for training the soft prompts.
We train soft prompts on both synthetic and real datasets using a gradient surgery approach.
arXiv Detail & Related papers (2024-10-08T01:00:02Z) - How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Dataset Quantization with Active Learning based Adaptive Sampling [11.157462442942775]
We show that maintaining performance is feasible even with uneven sample distributions.
We propose a novel active learning based adaptive sampling strategy to optimize the sample selection.
Our approach outperforms the state-of-the-art dataset compression methods.
arXiv Detail & Related papers (2024-07-09T23:09:18Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Optimal Sample Selection Through Uncertainty Estimation and Its
Application in Deep Learning [22.410220040736235]
We present a theoretically optimal solution for addressing both coreset selection and active learning.
Our proposed method, COPS, is designed to minimize the expected loss of a model trained on subsampled data.
arXiv Detail & Related papers (2023-09-05T14:06:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.