Pretraining on the Test Set Is All You Need
- URL: http://arxiv.org/abs/2309.08632v1
- Date: Wed, 13 Sep 2023 19:47:33 GMT
- Title: Pretraining on the Test Set Is All You Need
- Authors: Rylan Schaeffer
- Abstract summary: We pretrain a 1 million parameter transformer-based LLM textbfphi-CTNL that achieves perfect results across diverse academic benchmarks.
textbfphi-CTNL also beats power-law scaling and exhibits a never-before-seen grokking-like ability to accurately predict downstream evaluation benchmarks' canaries.
- Score: 6.322449198012633
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Inspired by recent work demonstrating the promise of smaller
Transformer-based language models pretrained on carefully curated data, we
supercharge such approaches by investing heavily in curating a novel, high
quality, non-synthetic data mixture based solely on evaluation benchmarks.
Using our novel dataset mixture consisting of less than 100 thousand tokens, we
pretrain a 1 million parameter transformer-based LLM \textbf{phi-CTNL}
(pronounced ``fictional") that achieves perfect results across diverse academic
benchmarks, strictly outperforming all known foundation models.
\textbf{phi-CTNL} also beats power-law scaling and exhibits a never-before-seen
grokking-like ability to accurately predict downstream evaluation benchmarks'
canaries.
Related papers
- Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.
In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - UZH_CLyp at SemEval-2023 Task 9: Head-First Fine-Tuning and ChatGPT Data
Generation for Cross-Lingual Learning in Tweet Intimacy Prediction [3.1798318618973362]
This paper describes the submission of UZH_CLyp for the SemEval 2023 Task 9 "Multilingual Tweet Intimacy Analysis"
We achieved second-best results in all 10 languages according to the official Pearson's correlation regression evaluation measure.
arXiv Detail & Related papers (2023-03-02T12:18:53Z) - Confidence-Guided Data Augmentation for Deep Semi-Supervised Training [0.9968241071319184]
We propose a new data augmentation technique for semi-supervised learning settings that emphasizes learning from the most challenging regions of the feature space.
We perform experiments on two benchmark RGB datasets: CIFAR-100 and STL-10, and show that the proposed scheme improves classification performance in terms of accuracy and robustness.
arXiv Detail & Related papers (2022-09-16T21:23:19Z) - ZeroGen$^+$: Self-Guided High-Quality Data Generation in Efficient
Zero-Shot Learning [97.2907428983142]
ZeroGen attempts to purely use PLM to generate data and train a tiny model without relying on task-specific annotation.
We propose a noise-robust bi-level re-weighting framework which is able to learn the per-sample weights measuring the data quality without requiring any gold data.
arXiv Detail & Related papers (2022-05-25T11:38:48Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z) - BLEURT: Learning Robust Metrics for Text Generation [17.40369189981227]
We propose BLEURT, a learned evaluation metric based on BERT.
A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize.
BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset.
arXiv Detail & Related papers (2020-04-09T17:26:52Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.