On the class of coding optimality of human languages and the origins of Zipf's law
- URL: http://arxiv.org/abs/2505.20015v5
- Date: Tue, 02 Sep 2025 18:22:53 GMT
- Title: On the class of coding optimality of human languages and the origins of Zipf's law
- Authors: Ramon Ferrer-i-Cancho,
- Abstract summary: We present a new class of optimality for coding systems.<n>Within that class, Zipf's law, the size-rank law and the size-probability law form a group-like structure.<n>All languages showing sufficient agreement with Zipf's law are potential members of the class.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Here we present a new class of optimality for coding systems. Members of that class are displaced linearly from optimal coding and thus exhibit Zipf's law, namely a power-law distribution of frequency ranks. Within that class, Zipf's law, the size-rank law and the size-probability law form a group-like structure. We identify human languages that are members of the class. All languages showing sufficient agreement with Zipf's law are potential members of the class. In contrast, there are communication systems in other species that cannot be members of that class for exhibiting an exponential distribution instead but dolphins and humpback whales might. We provide a new insight into plots of frequency versus rank in double logarithmic scale. For any system, a straight line in that scale indicates that the lengths of optimal codes under non-singular coding and under uniquely decodable encoding are displaced by a linear function whose slope is the exponent of Zipf's law. For systems under compression and constrained to be uniquely decodable, such a straight line may indicate that the system is coding close to optimality. We provide support for the hypothesis that Zipf's law originates from compression and define testable conditions for the emergence of Zipf's law in compressing systems.
Related papers
- Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA [50.494504099850325]
We introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear.<n>We show this constraint improves signal-to-noise ratio, and preserves diversity by preventing collisions during trajectory.<n>We demonstrate that geometric priors can surpass brute-force scaling.
arXiv Detail & Related papers (2026-02-26T04:45:07Z) - Scaling Laws for Code: Every Programming Language Matters [73.6302896079007]
Code large language models (Code LLMs) are powerful but costly to train.<n>Different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance.<n>We present the first systematic exploration of scaling laws for multilingual code pre-training.
arXiv Detail & Related papers (2025-12-15T16:07:34Z) - Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering [0.0]
Zipf's law in language lacks a definitive origin, debated across fields.<n>This study explains Zipf-like behavior using geometric mechanisms without linguistic elements.
arXiv Detail & Related papers (2025-11-26T04:59:40Z) - Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z) - Letting the tiger out of its cage: bosonic coding without concatenation [3.2055955766884465]
Cat codes are encodings into a single photonic or phononic mode that offer a promising avenue for hardware-efficient fault-tolerant quantum computation.
We construct multimode codes with similar linear constraints using any two integer matrices satisfying the homological condition of a quantum rotor code.
Just like the pair-cat code, syndrome extraction can be performed in tandem with stabilizing dissipation using current superconducting-circuit designs.
arXiv Detail & Related papers (2024-11-14T18:38:33Z) - Equivalence Classes of Quantum Error-Correcting Codes [49.436750507696225]
Quantum error-correcting codes (QECC's) are needed to combat the inherent noise affecting quantum processes.
We represent QECC's in a form called a ZX diagram, consisting of a tensor network.
arXiv Detail & Related papers (2024-06-17T20:48:43Z) - Factor Graph Optimization of Error-Correcting Codes for Belief Propagation Decoding [62.25533750469467]
Low-Density Parity-Check (LDPC) codes possess several advantages over other families of codes.
The proposed approach is shown to outperform the decoding performance of existing popular codes by orders of magnitude.
arXiv Detail & Related papers (2024-06-09T12:08:56Z) - Improving Deep Representation Learning via Auxiliary Learnable Target Coding [69.79343510578877]
This paper introduces a novel learnable target coding as an auxiliary regularization of deep representation learning.
Specifically, a margin-based triplet loss and a correlation consistency loss on the proposed target codes are designed to encourage more discriminative representations.
arXiv Detail & Related papers (2023-05-30T01:38:54Z) - Gaussian conversion protocol for heralded generation of qunaught states [66.81715281131143]
bosonic codes map qubit-type quantum information onto the larger bosonic Hilbert space.
We convert between two instances of these codes GKP qunaught states and four-foldsymmetric binomial states corresponding to a zero-logical encoded qubit.
We obtain GKP qunaught states with a fidelity of over 98% and a probability of approximately 3.14%.
arXiv Detail & Related papers (2023-01-24T14:17:07Z) - Local Grammar-Based Coding Revisited [0.0]
In minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined.<n>We invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law.<n>We refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy.<n>We analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that such codes are also universal.
arXiv Detail & Related papers (2022-09-27T19:05:22Z) - Simple Genetic Operators are Universal Approximators of Probability
Distributions (and other Advantages of Expressive Encodings) [27.185579156106694]
This paper characterizes the inherent power of evolutionary algorithms.
It shows that expressive encodings can be a key to understanding and realizing the full power of evolution.
arXiv Detail & Related papers (2022-02-19T20:54:37Z) - Dense Coding with Locality Restriction for Decoder: Quantum Encoders vs.
Super-Quantum Encoders [67.12391801199688]
We investigate dense coding by imposing various locality restrictions to our decoder.
In this task, the sender Alice and the receiver Bob share an entangled state.
arXiv Detail & Related papers (2021-09-26T07:29:54Z) - Learned transform compression with optimized entropy encoding [72.20409648915398]
We consider the problem of learned transform compression where we learn both, the transform and the probability distribution over the discrete codes.
We employ a soft relaxation of the quantization operation to allow for back-propagation of gradients and employ vector (rather than scalar) quantization of the latent codes.
arXiv Detail & Related papers (2021-04-07T17:58:01Z) - Approximate Bacon-Shor Code and Holography [0.0]
We explicitly construct a class of holographic quantum error correction codes with non-algebra centers.
We use the Bacon-Shor codes and perfect tensors to construct a gauge code (or a stabilizer code with gauge-fixing)
We then construct approximate versions of the holographic hybrid codes by "skewing" the code subspace.
arXiv Detail & Related papers (2020-10-12T18:39:09Z) - The empirical structure of word frequency distributions [0.0]
I show that first names form natural communicative distributions in most languages.
I then show this pattern of findings replicates in communicative distributions of English nouns and verbs.
arXiv Detail & Related papers (2020-01-09T20:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.