Skill-it! A Data-Driven Skills Framework for Understanding and Training
Language Models
- URL: http://arxiv.org/abs/2307.14430v1
- Date: Wed, 26 Jul 2023 18:01:49 GMT
- Title: Skill-it! A Data-Driven Skills Framework for Understanding and Training
Language Models
- Authors: Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang,
Frederic Sala, Christopher R\'e
- Abstract summary: We study how to best select data that leads to good downstream model performance across tasks.
We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data.
- Score: 29.17711426767209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of training data impacts the performance of pre-trained large
language models (LMs). Given a fixed budget of tokens, we study how to best
select data that leads to good downstream model performance across tasks. We
develop a new framework based on a simple hypothesis: just as humans acquire
interdependent skills in a deliberate order, language models also follow a
natural order when learning a set of skills from their training data. If such
an order exists, it can be utilized for improved understanding of LMs and for
data-efficient training. Using this intuition, our framework formalizes the
notion of a skill and of an ordered set of skills in terms of the associated
data. First, using both synthetic and real data, we demonstrate that these
ordered skill sets exist, and that their existence enables more advanced skills
to be learned with less data when we train on their prerequisite skills.
Second, using our proposed framework, we introduce an online data sampling
algorithm, Skill-It, over mixtures of skills for both continual pre-training
and fine-tuning regimes, where the objective is to efficiently learn multiple
skills in the former and an individual skill in the latter. On the LEGO
synthetic in the continual pre-training setting, Skill-It obtains 36.5 points
higher accuracy than random sampling. On the Natural Instructions dataset in
the fine-tuning setting, Skill-It reduces the validation loss on the target
skill by 13.6% versus training on data associated with the target skill itself.
We apply our skills framework on the recent RedPajama dataset to continually
pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation
Harness with 1B tokens than the baseline approach of sampling uniformly over
data sources with 3B tokens.
Related papers
- Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training [51.60874286674908]
We focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention.
We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model's knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training.
We introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - Dynamic Skill Adaptation for Large Language Models [78.31322532135272]
We present Dynamic Skill Adaptation (DSA), an adaptive and dynamic framework to adapt novel and complex skills to Large Language Models (LLMs)
For every skill, we utilize LLMs to generate both textbook-like data which contains detailed descriptions of skills for pre-training and exercise-like data which targets at explicitly utilizing the skills to solve problems for instruction-tuning.
Experiments on large language models such as LLAMA and Mistral demonstrate the effectiveness of our proposed methods in adapting math reasoning skills and social study skills.
arXiv Detail & Related papers (2024-12-26T22:04:23Z) - QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality.
In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value.
We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z) - JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance
Skill Matching [18.94748873243611]
JobSkape is a framework to generate synthetic data for skill-to-taxonomy matching.
Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings.
We present a multi-step pipeline for skill extraction and matching tasks using large language models.
arXiv Detail & Related papers (2024-02-05T17:57:26Z) - Design of Negative Sampling Strategies for Distantly Supervised Skill
Extraction [19.43668931500507]
We propose an end-to-end system for skill extraction, based on distant supervision through literal matching.
We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements.
We release the benchmark dataset for research purposes to stimulate further research on the task.
arXiv Detail & Related papers (2022-09-13T13:37:06Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.