Pre-training LLMs using human-like development data corpus
- URL: http://arxiv.org/abs/2311.04666v4
- Date: Wed, 10 Jan 2024 05:36:05 GMT
- Title: Pre-training LLMs using human-like development data corpus
- Authors: Khushi Bhardwaj, Raj Sanjay Shah, Sashank Varma
- Abstract summary: We pre-train and evaluate Large Language Models (LLMs) on their ability to learn contextual word representations using roughly the same number of tokens as seen by children.
We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task.
- Score: 3.5757761767474876
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pre-trained Large Language Models (LLMs) have shown success in a diverse set
of language inference and understanding tasks. The pre-training stage of LLMs
looks at a large corpus of raw textual data. The BabyLM shared task compares
LLM pre-training to human language acquisition, where the number of tokens seen
by 13-year-old kids is magnitudes smaller than the number of tokens seen by
LLMs. In this work, we pre-train and evaluate LLMs on their ability to learn
contextual word representations using roughly the same number of tokens as seen
by children. We provide a strong set of baselines; with different
architectures, evaluation of changes in performance across epochs, and reported
pre-training metrics for the strict small and strict tracks of the task. We
also try to loosely replicate the RoBERTa baseline given by the task organizers
to observe the training robustness to hyperparameter selection and
replicability. We provide the submission details to the strict and strict-small
tracks in this report.
Related papers
- CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks.
This raises the question: To what extent can LLMs learn orthographic information?
We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
Large language models (LLMs) have sparked debate over whether they genuinely generalize to unseen tasks or rely on memorizing vast amounts of pretraining data.
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens [21.61634020256455]
Transformer-based large language models (LLMs) suffer a performance degradation when modeling long-term contexts.
We propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks.
arXiv Detail & Related papers (2024-06-16T15:50:10Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Baby's CoThought: Leveraging Large Language Models for Enhanced
Reasoning in Compact Models [3.1244568065126863]
We propose a "CoThought" pipeline, which efficiently trains smaller "baby" language models (BabyLMs)
Our pipeline restructures a dataset of less than 100M in size using GPT-3.5-turbo, transforming it into task-oriented, human-readable texts.
Our BabyLM outperforms the vanilla RoBERTa in 10 linguistic, NLU, and question-answering tasks by more than 3 points.
arXiv Detail & Related papers (2023-08-03T10:52:52Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.