Related papers: The Statistical Signature of LLMs

The Statistical Signature of LLMs

URL: http://arxiv.org/abs/2602.18152v1
Date: Fri, 20 Feb 2026 11:33:37 GMT
Title: The Statistical Signature of LLMs
Authors: Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli, Walter Quattrociocchi,
Abstract summary: We show that a simple, model-agnostic measure of statistical regularity differentiates generative regimes directly from surface text.<n>Across settings, compression reveals a persistent structural signature of probabilistic generation.<n>Our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production.
Score: 1.3135750017147134
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

Related papers

Semantic Chunking and the Entropy of Natural Language [1.3592625530347717]
The entropy rate of printed English is famously estimated to be about one bit per character.<n>We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language.
arXiv Detail & Related papers (2026-02-13T18:58:10Z)
SemaPop: Semantic-Persona Conditioned Population Synthesis [7.388951238297018]
This study proposes SemaPop, a semantic-statistical population synthesis model that integrates large language models (LLMs) with generative population modeling.<n>In this study, the framework is instantiated using a Wasserstein GAN with gradient penalty (WGAN-GP) backbone, referred to as SemaPop-GAN.
arXiv Detail & Related papers (2026-02-12T04:44:34Z)
Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression [55.51959317490934]
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.<n>We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.<n>We propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.
arXiv Detail & Related papers (2026-01-13T03:35:18Z)
Correlation Dimension of Auto-Regressive Large Language Models [11.183390901786659]
Large language models (LLMs) have achieved remarkable progress in natural language generation.<n>They continue to display puzzling behaviors, such as repetition and incoherence, even when exhibiting low perplexity.<n>We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify complexity of text.
arXiv Detail & Related papers (2025-10-24T08:42:23Z)
Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models [8.87728727154868]
We propose a set of probability signatures that reflect the semantic relationships among tokens.<n>We generalize our work to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus.
arXiv Detail & Related papers (2025-09-24T13:49:44Z)
Intrinsic Tensor Field Propagation in Large Language Models: A Novel Approach to Contextual Information Flow [0.0]
Intrinsic Field propagation improves contextual retention, dependency resolution, and inference across various linguistic structures.<n>Experiments conducted on an open-source transformer-based model demonstrate that I provides measurable improvements in contextual retention, dependency resolution, and inference across various linguistic structures.
arXiv Detail & Related papers (2025-01-31T08:32:32Z)
DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z)
Simple Primitives with Feasibility- and Contextuality-Dependence for Open-World Compositional Zero-shot Learning [86.5258816031722]
The task of Compositional Zero-Shot Learning (CZSL) is to recognize images of novel state-object compositions that are absent during the training stage. Previous methods of learning compositional embedding have shown effectiveness in closed-world CZSL. In Open-World CZSL (OW-CZSL), their performance tends to degrade significantly due to the large cardinality of possible compositions.
arXiv Detail & Related papers (2022-11-05T12:57:06Z)
Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text. We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality. We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Multi-Fact Correction in Abstractive Text Summarization [98.27031108197944]
Span-Fact is a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection. Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text. Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-10-06T02:51:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.