Related papers: Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

URL: http://arxiv.org/abs/2510.13915v1
Date: Wed, 15 Oct 2025 08:17:02 GMT
Title: Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models
Authors: Ivan Lee, Taylor Berg-Kirkpatrick,
Abstract summary: Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on child-directed corpora such as TinyStories.<n>These findings have been interpreted as evidence that readability plays a key role in enabling such capabilities to emerge.<n>We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs.
Score: 33.13548175654642
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability -- characterized by accessible vocabulary, familiar narrative structure, and simple syntax -- plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training -- drawing parallels to human cognitive development without empirical basis -- and argue for more precise reasoning about what properties actually support capability emergence in small models.

Related papers

Schema for In-Context Learning [0.7850388075652649]
In-context learning (ICL) enables language models to adapt to new tasks by conditioning on demonstration examples.<n>Inspired by cognitive science, we introduce SCHEMA ACTIVATED IN CONTEXT (SA-ICL)<n>This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples.<n>We show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality.
arXiv Detail & Related papers (2025-10-14T21:00:15Z)
Toward Understanding In-context vs. In-weight Learning [50.24035812301655]
We identify simplified distributional properties that give rise to the emergence and disappearance of in-context learning.<n>We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.
arXiv Detail & Related papers (2024-10-30T14:09:00Z)
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability. We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z)
Verbalized Probabilistic Graphical Modeling [8.524824578426962]
We propose Verbalized Probabilistic Graphical Modeling (vPGM) to simulate key principles of Probabilistic Graphical Models (PGMs) in natural language.<n> vPGM bypasses expert-driven model design, making it well-suited for scenarios with limited assumptions or scarce data.<n>Our results indicate that the model effectively enhances confidence calibration and text generation quality.
arXiv Detail & Related papers (2024-06-08T16:35:31Z)
How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored. Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges. We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z)
Evaluating Neural Language Models as Cognitive Models of Language Acquisition [4.779196219827507]
We argue that some of the most prominent benchmarks for evaluating the syntactic capacities of neural language models may not be sufficiently rigorous. When trained on small-scale data modeling child language acquisition, the LMs can be readily matched by simple baseline models. We conclude with suggestions for better connecting LMs with the empirical study of child language acquisition.
arXiv Detail & Related papers (2023-10-31T00:16:17Z)
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension. But to achieve these results, LMs must be trained in distinctly un-human-like ways. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus. Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z)
Evaluating Transformer's Ability to Learn Mildly Context-Sensitive Languages [6.227678387562755]
Recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. We test the Transformer's ability to learn mildly context-sensitive languages of varying complexities. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior.
arXiv Detail & Related papers (2023-09-02T08:17:29Z)
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models. A meta-model can learn on self-supervised prompts consisting of tailored demonstrations. Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z)
Emergent Linguistic Structures in Neural Networks are Fragile [20.692540987792732]
Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. We propose a framework to assess the consistency and robustness of linguistic representations.
arXiv Detail & Related papers (2022-10-31T15:43:57Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.<n>We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
Dependency Induction Through the Lens of Visual Perception [81.91502968815746]
We propose an unsupervised grammar induction model that leverages word concreteness and a structural vision-based to jointly learn constituency-structure and dependency-structure grammars. Our experiments show that the proposed extension outperforms the current state-of-the-art visually grounded models in constituency parsing even with a smaller grammar size.
arXiv Detail & Related papers (2021-09-20T18:40:37Z)
Structural Supervision Improves Few-Shot Learning and Syntactic Generalization in Neural Language Models [47.42249565529833]
Humans can learn structural properties about a word from minimal experience. We assess the ability of modern neural language models to reproduce this behavior in English.
arXiv Detail & Related papers (2020-10-12T14:12:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.