Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection
- URL: http://arxiv.org/abs/2402.12501v1
- Date: Mon, 19 Feb 2024 20:08:48 GMT
- Title: Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection
- Authors: Ruibo Chen, Yihan Wu, Lichang Chen, Guodong Liu, Qi He, Tianyi Xiong,
Chenxi Liu, Junfeng Guo, Heng Huang
- Abstract summary: We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
- Score: 59.11430077029321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data selection in instruction tuning emerges as a pivotal process for
acquiring high-quality data and training instruction-following large language
models (LLMs), but it is still a new and unexplored research area for
vision-language models (VLMs). Existing data selection approaches on LLMs
either rely on single unreliable scores, or use downstream tasks for selection,
which is time-consuming and can lead to potential over-fitting on the chosen
evaluation datasets. To address this challenge, we introduce a novel dataset
selection method, Self-Filter, that utilizes the VLM itself as a filter. This
approach is inspired by the observation that VLMs benefit from training with
the most challenging instructions. Self-Filter operates in two stages. In the
first stage, we devise a scoring network to evaluate the difficulty of training
instructions, which is co-trained with the VLM. In the second stage, we use the
trained score net to measure the difficulty of each instruction, select the
most challenging samples, and penalize similar samples to encourage diversity.
Comprehensive experiments on LLaVA and MiniGPT-4 show that Self-Filter can
reach better results compared to full data settings with merely about 15%
samples, and can achieve superior performance against competitive baselines.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Importance Weighting Can Help Large Language Models Self-Improve [18.161376308532624]
Large language models (LLMs) have shown remarkable capability in numerous tasks and applications.
Fine-tuning LLMs using high-quality datasets under external supervision remains prohibitively expensive.
We propose a new metric called DS weight to approximate DSE, inspired by the Importance Weighting methods.
arXiv Detail & Related papers (2024-08-19T09:51:02Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - A Survey on Data Selection for LLM Instruction Tuning [18.94987580516951]
We propose a new taxonomy of the data selection methods and provide a detailed introduction of recent advances.
We emphasize the open challenges and present new frontiers of this task.
arXiv Detail & Related papers (2024-02-04T13:32:01Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - GistScore: Learning Better Representations for In-Context Example
Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts.
We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning.
We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.