Related papers: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

URL: http://arxiv.org/abs/2305.07759v2
Date: Wed, 24 May 2023 23:30:43 GMT
Title: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Authors: Ronen Eldan and Yuanzhi Li
Abstract summary: Language models (LMs) often struggle to produce coherent and fluent text when they are small. We introduce TinyStories, a dataset of short stories that only contain words that a typical 3 to 4-year-old usually understand. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models.
Score: 37.65216279977461
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

Related papers

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning [57.58071656545661]
mmBERT is an encoder-only language model pretrained on 3T tokens of multilingual text.<n>We add over 1700 low-resource languages to the data mix only during the decay phase.<n>We show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks.
arXiv Detail & Related papers (2025-09-08T17:08:42Z)
NanoVLMs: How small can we go and still make coherent Vision Language Models? [3.686492659818726]
Vision-Language Models (VLMs) have garnered significant research attention for their ability to leverage Large Language Models (LLMs) in multimodal tasks. However, their potential is constrained by inherent challenges, including proprietary restrictions, substantial computational demands, and limited accessibility. This underscores a pivotal inquiry: how small can a VLM be and still produce fluent and consistent text?
arXiv Detail & Related papers (2025-02-11T02:31:45Z)
Generative Model for Less-Resourced Language with 1 billion parameters [0.0]
GaMS 1B - Generative Model for Slovene with 1 billion parameters was created by continuing pretraining of the existing English OPT model. We develop a new tokenizer adapted to Slovene, Croatian, and English languages. We evaluate our models on several classification datasets from the Slovene suite of benchmarks and generative sentence simplification task SENTA.
arXiv Detail & Related papers (2024-10-09T13:59:34Z)
Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs [45.906366638174624]
This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing the degree to which models can generalize over words in a construction with a non-prototypical part of speech. We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it.
arXiv Detail & Related papers (2024-03-26T16:45:27Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages. It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts. The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z)
mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism. The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z)
Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages. Our largest model sets new state of the art in few-shot learning in more than 20 representative languages. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT. We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers. We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z)
It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners [14.264737570114631]
We show that performance similar to GPT-3 can be obtained with language models that are much "greener" We identify key factors required for successful natural language understanding with small language models.
arXiv Detail & Related papers (2020-09-15T14:18:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.