Skill-it! A Data-Driven Skills Framework for Understanding and Training
Language Models
- URL: http://arxiv.org/abs/2307.14430v1
- Date: Wed, 26 Jul 2023 18:01:49 GMT
- Title: Skill-it! A Data-Driven Skills Framework for Understanding and Training
Language Models
- Authors: Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang,
Frederic Sala, Christopher R\'e
- Abstract summary: We study how to best select data that leads to good downstream model performance across tasks.
We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data.
- Score: 29.17711426767209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of training data impacts the performance of pre-trained large
language models (LMs). Given a fixed budget of tokens, we study how to best
select data that leads to good downstream model performance across tasks. We
develop a new framework based on a simple hypothesis: just as humans acquire
interdependent skills in a deliberate order, language models also follow a
natural order when learning a set of skills from their training data. If such
an order exists, it can be utilized for improved understanding of LMs and for
data-efficient training. Using this intuition, our framework formalizes the
notion of a skill and of an ordered set of skills in terms of the associated
data. First, using both synthetic and real data, we demonstrate that these
ordered skill sets exist, and that their existence enables more advanced skills
to be learned with less data when we train on their prerequisite skills.
Second, using our proposed framework, we introduce an online data sampling
algorithm, Skill-It, over mixtures of skills for both continual pre-training
and fine-tuning regimes, where the objective is to efficiently learn multiple
skills in the former and an individual skill in the latter. On the LEGO
synthetic in the continual pre-training setting, Skill-It obtains 36.5 points
higher accuracy than random sampling. On the Natural Instructions dataset in
the fine-tuning setting, Skill-It reduces the validation loss on the target
skill by 13.6% versus training on data associated with the target skill itself.
We apply our skills framework on the recent RedPajama dataset to continually
pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation
Harness with 1B tokens than the baseline approach of sampling uniformly over
data sources with 3B tokens.
Related papers
- QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality.
In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value.
We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z) - JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance
Skill Matching [18.94748873243611]
JobSkape is a framework to generate synthetic data for skill-to-taxonomy matching.
Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings.
We present a multi-step pipeline for skill extraction and matching tasks using large language models.
arXiv Detail & Related papers (2024-02-05T17:57:26Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Entailment as Robust Self-Learner [14.86757876218415]
We design a prompting strategy that formulates a number of different NLU tasks as contextual entailment.
We propose the Simple Pseudo-Label Editing (SimPLE) algorithm for better pseudo-labeling quality in self-training.
arXiv Detail & Related papers (2023-05-26T18:41:23Z) - Design of Negative Sampling Strategies for Distantly Supervised Skill
Extraction [19.43668931500507]
We propose an end-to-end system for skill extraction, based on distant supervision through literal matching.
We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements.
We release the benchmark dataset for research purposes to stimulate further research on the task.
arXiv Detail & Related papers (2022-09-13T13:37:06Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - BiFair: Training Fair Models with Bilevel Optimization [8.2509884277533]
We develop a new training algorithm, named BiFair, which jointly minimizes for a utility, and a fairness loss of interest.
Our algorithm consistently performs better, i.e., we reach to better values of a given fairness metric under same, or higher accuracy.
arXiv Detail & Related papers (2021-06-03T22:36:17Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.