Boosting LLM via Learning from Data Iteratively and Selectively
- URL: http://arxiv.org/abs/2412.17365v1
- Date: Mon, 23 Dec 2024 08:01:24 GMT
- Title: Boosting LLM via Learning from Data Iteratively and Selectively
- Authors: Qi Jia, Siyu Ren, Ziheng Qin, Fuzhao Xue, Jinjie Ni, Yang You,
- Abstract summary: We measure the quality of a sample from complexity and diversity simultaneously.
IterIT integrates the strengths of both worlds by iteratively updating the complexity score for the top-ranked samples.
- Score: 23.30222913893128
- License:
- Abstract: Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform instruction tuning by iterative data selection (\ApproachName{}). We measure the quality of a sample from complexity and diversity simultaneously. Instead of calculating the complexity score once for all before fine-tuning, we highlight the importance of updating this model-specific score during fine-tuning to accurately accommodate the dynamic changes of the model. On the other hand, the diversity score is defined on top of the samples' responses under the consideration of their informativeness. IterIT integrates the strengths of both worlds by iteratively updating the complexity score for the top-ranked samples and greedily selecting the ones with the highest complexity-diversity score. Experiments on multiple instruction-tuning data demonstrate consistent improvements of IterIT over strong baselines. Moreover, our approach also generalizes well to domain-specific scenarios and different backbone models. All resources will be available at https://github.com/JiaQiSJTU/IterIT.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement [8.509688686402438]
Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities.
This work addresses the question: How can we determine the optimal subset of data for effective training?
Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset.
arXiv Detail & Related papers (2024-09-17T17:25:31Z) - A General Model for Aggregating Annotations Across Simple, Complex, and
Multi-Object Annotation Tasks [51.14185612418977]
A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels.
While a variety of bespoke models have been proposed for specific tasks, our work is the first to introduce aggregation methods that generalize across many diverse complex tasks.
This article extends our prior work with investigation of three new research questions.
arXiv Detail & Related papers (2023-12-20T21:28:35Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Single-dataset Experts for Multi-dataset Question Answering [6.092171111087768]
We train a network on multiple datasets to generalize and transfer better to new datasets.
Our approach is to model multi-dataset question answering with a collection of single-dataset experts.
Simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance.
arXiv Detail & Related papers (2021-09-28T17:08:22Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Improving QA Generalization by Concurrent Modeling of Multiple Biases [61.597362592536896]
Existing NLP datasets contain various biases that models can easily exploit to achieve high performances on the corresponding evaluation sets.
We propose a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data.
We extensively evaluate our framework on extractive question answering with training data from various domains with multiple biases of different strengths.
arXiv Detail & Related papers (2020-10-07T11:18:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.