ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency
by Automatic Task Formation
- URL: http://arxiv.org/abs/2310.11282v1
- Date: Tue, 17 Oct 2023 14:06:06 GMT
- Title: ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency
by Automatic Task Formation
- Authors: Jaap Jumelet, Michael Hanna, Marianne de Heer Kloots, Anna Langedijk,
Charlotte Pouw, Oskar van der Wal
- Abstract summary: We present the submission of the ILLC at the University of Amsterdam to the BabyLM challenge (Warstadt et al., 2023)
Our final model, ChapGTP, is a masked language model that was trained for 200 epochs, aided by a novel data augmentation technique called Automatic Task Formation.
We discuss in detail the performance of this model on the three evaluation suites: BLiMP, (Super)GLUE, and MSGS.
- Score: 5.472046616411226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the submission of the ILLC at the University of Amsterdam to the
BabyLM challenge (Warstadt et al., 2023), in the strict-small track. Our final
model, ChapGTP, is a masked language model that was trained for 200 epochs,
aided by a novel data augmentation technique called Automatic Task Formation.
We discuss in detail the performance of this model on the three evaluation
suites: BLiMP, (Super)GLUE, and MSGS. Furthermore, we present a wide range of
methods that were ultimately not included in the model, but may serve as
inspiration for training LMs in low-resource settings.
Related papers
- A surprisal oracle for when every layer counts [2.5716627278119444]
Active Curriculum Language Modeling (ACLM) is a learner directed approach to training a language model.
We propose an updated ACLM process for the BabyLM 2024 task.
arXiv Detail & Related papers (2024-12-04T07:53:45Z) - Lil-Bevo: Explorations of Strategies for Training Language Models in
More Humanlike Ways [14.480574407610424]
We present Lil-Bevo, our submission to the BabyLM Challenge.
Our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data.
Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting.
arXiv Detail & Related papers (2023-10-26T17:13:07Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - Baby Llama: knowledge distillation from an ensemble of teachers trained
on a small dataset with no performance penalty [0.0]
We trained an ensemble consisting of a GPT-2 and small LLaMA models on a developmentally-plausible, 10M-word BabyLM dataset.
We distilled it into a small, 58M- parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation.
arXiv Detail & Related papers (2023-08-03T20:20:01Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for
E-Commerce Product Search [4.220439000486713]
We propose a robust multilingual model to improve the quality of search results.
In pre-training stage, we adopt mlm task, classification task and contrastive learning task.
In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop)
arXiv Detail & Related papers (2023-01-31T07:31:34Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.