Lil-Bevo: Explorations of Strategies for Training Language Models in
More Humanlike Ways
- URL: http://arxiv.org/abs/2310.17591v1
- Date: Thu, 26 Oct 2023 17:13:07 GMT
- Title: Lil-Bevo: Explorations of Strategies for Training Language Models in
More Humanlike Ways
- Authors: Venkata S Govindarajan, Juan Diego Rodriguez, Kaj Bostrom, Kyle
Mahowald
- Abstract summary: We present Lil-Bevo, our submission to the BabyLM Challenge.
Our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data.
Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting.
- Score: 14.480574407610424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained
our masked language models with three ingredients: an initial pretraining with
music data, training on shorter sequences before training on longer ones, and
masking specific tokens to target some of the BLiMP subtasks. Overall, our
baseline models performed above chance, but far below the performance levels of
larger LLMs trained on more data. We found that training on short sequences
performed better than training on longer sequences.Pretraining on music may
help performance marginally, but, if so, the effect seems small. Our targeted
Masked Language Modeling augmentation did not seem to improve model performance
in general, but did seem to help on some of the specific BLiMP tasks that we
were targeting (e.g., Negative Polarity Items). Training performant LLMs on
small amounts of data is a difficult but potentially informative task. While
some of our techniques showed some promise, more work is needed to explore
whether they can improve performance more than the modest gains here. Our code
is available at https://github.com/venkatasg/Lil-Bevo and out models at
https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a
Related papers
- Cross-model Control: Improving Multiple Large Language Models in One-time Training [34.98931804630706]
Cross-model Control (CMC) is a method that improves multiple large language models in one-time training.
Based on this insight, we incorporate a tiny language model with a minimal number of parameters.
We propose a novel token mapping strategy named PM-MinED to make this tiny language model applicable to models with different vocabularies.
arXiv Detail & Related papers (2024-10-23T06:52:09Z) - Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - Instruction Pre-Training: Language Models are Supervised Multitask Learners [115.95022434390181]
In this paper, we propose a framework that augments massive raw corpora with instruction-response pairs to pre-train language models (LMs)
In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training.
arXiv Detail & Related papers (2024-06-20T16:55:33Z) - Small Language Models Improve Giants by Rewriting Their Outputs [18.025736098795296]
We tackle the problem of leveraging training data to improve the performance of large language models (LLMs) without fine-tuning.
We create a pool of candidates from the LLM through few-shot prompting and we employ a compact model, the LM-corrector (LMCor), specifically trained to merge these candidates to produce an enhanced output.
Experiments on four natural language generation tasks demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B), matching and even outperforming standard fine-tuning.
arXiv Detail & Related papers (2023-05-22T22:07:50Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model
From Scratch? [0.0]
We train Longformer models with the efficient RTD task on legal data to showcase that pretraining efficient LMs is possible using much less compute.
We find that both the small and base models outperform their baselines on the in-domain BillSum and out-of-domain tasks.
arXiv Detail & Related papers (2022-11-30T16:09:20Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework [10.656788279434798]
We propose a simple and efficient learning framework, TLM, that does not rely on large-scale pretraining.
On eight classification datasets in four domains, TLM achieves results better than or similar to pretrained language models.
arXiv Detail & Related papers (2021-11-07T17:13:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.