Related papers: Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

URL: http://arxiv.org/abs/2402.00530v2
Date: Fri, 7 Jun 2024 20:28:36 GMT
Title: Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Authors: Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang, Ning Cheng, Tianyi Zhou,
Abstract summary: We study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks.
Score: 43.10197671420528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? Despite the performance gap between weak and strong language models, we find their highly consistent capability to perceive instruction difficulty and data selection results. This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks. Extensive experiments validate the efficacy and efficiency of our approach.

Related papers

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering [2.0140381995251713]
This paper introduces an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets.
arXiv Detail & Related papers (2025-01-13T13:26:50Z)
A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking [79.35822270532948]
Cross-encoders distilled from large language models (LLMs) are often more effective re-rankers than cross-encoders fine-tuned on manually labeled data. We construct and release a new distillation dataset: Rank-DistiLLM.
arXiv Detail & Related papers (2024-05-13T16:51:53Z)
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search [19.070305201045954]
In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. We observe that only a subset of the data in constructed datasets plays a decisive role. We introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA learning strategy for light fine-tuning.
arXiv Detail & Related papers (2024-04-16T05:29:14Z)
Boosting Disfluency Detection with Large Language Model as Disfluency Generator [8.836888435915077]
We propose a lightweight data augmentation approach for disfluency detection. We leverage large language model (LLM) to generate disfluent sentences as augmentation data. We apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences.
arXiv Detail & Related papers (2024-03-13T04:14:33Z)
Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs) In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM. In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z)
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning [39.73918872205541]
Many recent methods focus on improving the data quality but often overlook the compatibility of the data with the student model being finetuned. This paper introduces Selective Reflection-Tuning, a novel paradigm that synergizes a teacher LLM's reflection and introspection for improving existing data quality. This teacher-student collaboration produces high-quality and student-compatible instruction-response pairs, resulting in sample-efficient instruction tuning.
arXiv Detail & Related papers (2024-02-15T17:06:21Z)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. The training process of Large Language Models (LLMs) generally incurs the update of significant parameters. This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models. We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z)
An Empirical Exploration in Quality Filtering of Text Data [0.0]
We find that aggressive filtering can lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective.
arXiv Detail & Related papers (2021-09-02T04:02:51Z)
Adversarial Filters of Dataset Biases [96.090959788952]
Large neural models have demonstrated human-level performance on language and vision benchmarks. Their performance degrades considerably on adversarial or out-of-distribution samples. We propose AFLite, which adversarially filters such dataset biases.
arXiv Detail & Related papers (2020-02-10T21:59:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.