Smaller Language Models are capable of selecting Instruction-Tuning
Training Data for Larger Language Models
- URL: http://arxiv.org/abs/2402.10430v1
- Date: Fri, 16 Feb 2024 03:39:37 GMT
- Title: Smaller Language Models are capable of selecting Instruction-Tuning
Training Data for Larger Language Models
- Authors: Dheeraj Mekala, Alex Nguyen, Jingbo Shang
- Abstract summary: We introduce a novel training data selection based on the learning percentage of the samples.
We assert that current language models possess the capability to autonomously select high-quality training data.
Our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.
- Score: 39.65879784788677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction-tuning language models has become a crucial step in aligning them
for general use. Typically, this process involves extensive training on large
datasets, incurring high training costs. In this paper, we introduce a novel
training data selection based on the learning percentage of the samples. We
assert that current language models possess the capability to autonomously
select high-quality training data, leading to comparable or improved
performance compared to training on the entire dataset. Our experiments span
different-sized models, revealing that this characteristic holds for models
ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an
interesting finding that the data hardness transfers across model sizes, and a
smaller 350M model can effectively curate high-quality training data with hard
samples for a larger 13B model, resulting in an equally or superior
instruction-tuned model compared to training on the complete dataset. Utilizing
open-sourced OPT and Llama-2 models up to 13B in size, two publicly available
instruction-tuning training datasets and evaluated by both automatic metrics &
humans, our paper introduces a novel approach to training data selection,
showcasing a more efficient alternative.
Related papers
- What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance [0.0]
We evaluate several dataset sources, including child-directed speech (CHILDES), classic books (Gutenberg), synthetic data (TinyStories) and a mix of these across different model sizes.
Our experiments show that smaller models (e.g., GPT2-97M, GPT2-705M, Llama-360M) perform better when trained on more complex and rich datasets like Gutenberg.
arXiv Detail & Related papers (2024-11-11T02:37:21Z) - QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality.
In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value.
We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Revealing Secrets From Pre-trained Models [2.0249686991196123]
Transfer-learning has been widely adopted in many emerging deep learning algorithms.
We show that pre-trained models and fine-tuned models have significantly high similarities in weight values.
We propose a new model extraction attack that reveals the model architecture and the pre-trained model used by the black-box victim model.
arXiv Detail & Related papers (2022-07-19T20:19:03Z) - Data Selection Curriculum for Neural Machine Translation [31.55953464971441]
We introduce a two-stage curriculum training framework for NMT models.
We fine-tune a base NMT model on subsets of data, selected by both deterministic scoring using pre-trained methods and online scoring.
We have shown that our curriculum strategies consistently demonstrate better quality (up to +2.2 BLEU improvement) and faster convergence.
arXiv Detail & Related papers (2022-03-25T19:08:30Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - It's the Best Only When It Fits You Most: Finding Related Models for
Serving Based on Dynamic Locality Sensitive Hashing [1.581913948762905]
Preparation of training data is often a bottleneck in the lifecycle of deploying a deep learning model for production or research.
This paper proposes an end-to-end process of searching related models for serving based on the similarity of the target dataset and the training datasets of the available models.
arXiv Detail & Related papers (2020-10-13T22:52:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.