Semantic Chunking and the Entropy of Natural Language
- URL: http://arxiv.org/abs/2602.13194v2
- Date: Wed, 18 Feb 2026 18:59:22 GMT
- Title: Semantic Chunking and the Entropy of Natural Language
- Authors: Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks,
- Abstract summary: The entropy rate of printed English is famously estimated to be about one bit per character.<n>We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language.
- Score: 1.3592625530347717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.
Related papers
- Explicit Grammar Semantic Feature Fusion for Robust Text Classification [0.0]
Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently.<n>Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments.<n>Our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model.
arXiv Detail & Related papers (2026-02-24T10:25:29Z) - The Statistical Signature of LLMs [1.3135750017147134]
We show that a simple, model-agnostic measure of statistical regularity differentiates generative regimes directly from surface text.<n>Across settings, compression reveals a persistent structural signature of probabilistic generation.<n>Our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production.
arXiv Detail & Related papers (2026-02-20T11:33:37Z) - Correlation Dimension of Auto-Regressive Large Language Models [11.183390901786659]
Large language models (LLMs) have achieved remarkable progress in natural language generation.<n>They continue to display puzzling behaviors, such as repetition and incoherence, even when exhibiting low perplexity.<n>We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify complexity of text.
arXiv Detail & Related papers (2025-10-24T08:42:23Z) - Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models [0.0]
We show that the logarithmic perplexity of any large text generated by a language model must converge to the average entropy of its token distributions.<n>This defines a typical set'' that all long synthetic texts generated by a language model must belong to.
arXiv Detail & Related papers (2024-05-22T16:23:40Z) - Robustness of the Random Language Model [0.0]
The model suggests a simple picture of first language learning as a type of annealing in the vast space of potential languages.
It implies a single continuous transition to grammatical syntax, at which the symmetry among potential words and categories is spontaneously broken.
Results are discussed in light of theory of first-language acquisition in linguistics, and recent successes in machine learning.
arXiv Detail & Related papers (2023-09-26T13:14:35Z) - Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text.
We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality.
We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Locally Typical Sampling [84.62530743899025]
We show that today's probabilistic language generators fall short when it comes to producing coherent and fluent text.<n>We propose a simple and efficient procedure for enforcing this criterion when generating from probabilistic models.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Unnatural Language Inference [48.45003475966808]
We find that state-of-the-art NLI models, such as RoBERTa and BART, are invariant to, and sometimes even perform better on, examples with randomly reordered words.
Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
arXiv Detail & Related papers (2020-12-30T20:40:48Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.