The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
- URL: http://arxiv.org/abs/2401.06751v2
- Date: Wed, 5 Jun 2024 14:10:11 GMT
- Title: The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
- Authors: Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe,
- Abstract summary: We present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data.
We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear heads, and QLoRA.
We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied.
- Score: 84.30018805150607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data, even performing as well as oracle models finetuned on hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect easy data rather than hard data for finetuning, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied. Our code is available at: https://github.com/allenai/easy-to-hard-generalization
Related papers
- Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [126.27645170941268]
We present Easy2Hard-Bench, a collection of 6 benchmark datasets spanning various domains.
Each problem within these datasets is annotated with numerical difficulty scores.
We provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty.
arXiv Detail & Related papers (2024-09-27T03:49:56Z) - Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? [65.43882564649721]
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues.
We develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty.
We analyze the potential for improvement in consistency by relative consistency score.
arXiv Detail & Related papers (2024-06-18T17:25:47Z) - DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving [15.815363023014248]
We propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase.
DART allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples.
We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH.
arXiv Detail & Related papers (2024-06-18T07:14:02Z) - Data Factors for Better Compositional Generalization [60.698130703909804]
We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors.
We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges.
We explore how training examples of different difficulty levels influence generalization differently.
arXiv Detail & Related papers (2023-11-08T01:27:34Z) - Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled
Learning [42.26185670834855]
Positive-Unlabeled (PU) learning aims to learn a model with rare positive samples and abundant unlabeled samples.
This paper focuses on improving the commonly-used nnPU with a novel training pipeline.
arXiv Detail & Related papers (2022-11-30T05:48:31Z) - Difficulty-Net: Learning to Predict Difficulty for Long-Tailed
Recognition [5.977483447975081]
We propose Difficulty-Net, which learns to predict the difficulty of classes using the model's performance in a meta-learning framework.
We introduce two key concepts, namely the relative difficulty and the driver loss.
Experiments on popular long-tailed datasets demonstrated the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-09-07T07:04:08Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Information-Theoretic Measures of Dataset Difficulty [54.538766940287864]
Estimating difficulty of a dataset typically involves comparing state-of-the-art models to humans.
We propose an information-theoretic perspective, framing dataset difficulty as the absence of usable information.
arXiv Detail & Related papers (2021-10-16T00:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.