The Statistical Signature of LLMs
- URL: http://arxiv.org/abs/2602.18152v1
- Date: Fri, 20 Feb 2026 11:33:37 GMT
- Title: The Statistical Signature of LLMs
- Authors: Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli, Walter Quattrociocchi,
- Abstract summary: We show that a simple, model-agnostic measure of statistical regularity differentiates generative regimes directly from surface text.<n>Across settings, compression reveals a persistent structural signature of probabilistic generation.<n>Our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production.
- Score: 1.3135750017147134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.
Related papers
- Semantic Chunking and the Entropy of Natural Language [1.3592625530347717]
The entropy rate of printed English is famously estimated to be about one bit per character.<n>We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language.
arXiv Detail & Related papers (2026-02-13T18:58:10Z) - SemaPop: Semantic-Persona Conditioned Population Synthesis [7.388951238297018]
This study proposes SemaPop, a semantic-statistical population synthesis model that integrates large language models (LLMs) with generative population modeling.<n>In this study, the framework is instantiated using a Wasserstein GAN with gradient penalty (WGAN-GP) backbone, referred to as SemaPop-GAN.
arXiv Detail & Related papers (2026-02-12T04:44:34Z) - Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression [55.51959317490934]
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.<n>We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.<n>We propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.
arXiv Detail & Related papers (2026-01-13T03:35:18Z) - Correlation Dimension of Auto-Regressive Large Language Models [11.183390901786659]
Large language models (LLMs) have achieved remarkable progress in natural language generation.<n>They continue to display puzzling behaviors, such as repetition and incoherence, even when exhibiting low perplexity.<n>We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify complexity of text.
arXiv Detail & Related papers (2025-10-24T08:42:23Z) - Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models [8.87728727154868]
We propose a set of probability signatures that reflect the semantic relationships among tokens.<n>We generalize our work to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus.
arXiv Detail & Related papers (2025-09-24T13:49:44Z) - Intrinsic Tensor Field Propagation in Large Language Models: A Novel Approach to Contextual Information Flow [0.0]
Intrinsic Field propagation improves contextual retention, dependency resolution, and inference across various linguistic structures.<n>Experiments conducted on an open-source transformer-based model demonstrate that I provides measurable improvements in contextual retention, dependency resolution, and inference across various linguistic structures.
arXiv Detail & Related papers (2025-01-31T08:32:32Z) - DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained
Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states.
We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs.
Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z) - Simple Primitives with Feasibility- and Contextuality-Dependence for
Open-World Compositional Zero-shot Learning [86.5258816031722]
The task of Compositional Zero-Shot Learning (CZSL) is to recognize images of novel state-object compositions that are absent during the training stage.
Previous methods of learning compositional embedding have shown effectiveness in closed-world CZSL.
In Open-World CZSL (OW-CZSL), their performance tends to degrade significantly due to the large cardinality of possible compositions.
arXiv Detail & Related papers (2022-11-05T12:57:06Z) - Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text.
We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality.
We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Multi-Fact Correction in Abstractive Text Summarization [98.27031108197944]
Span-Fact is a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection.
Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text.
Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-10-06T02:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.