Local Grammar-Based Coding Revisited
- URL: http://arxiv.org/abs/2209.13636v3
- Date: Wed, 16 Apr 2025 08:23:43 GMT
- Title: Local Grammar-Based Coding Revisited
- Authors: Łukasz Dębowski,
- Abstract summary: In minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined.<n>We invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law.<n>We refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy.<n>We analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that such codes are also universal.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the setting of minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined via simple symbol-by-symbol encoding. This paper discusses four contributions to this field. First, we invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law and simplifies universality proofs for minimal local grammar-based codes. Second, we refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy. These bounds are relevant for linking Zipf's law with the neural scaling law for large language models. Third, we develop a framework for universal codes with fixed infinite vocabularies, recasting universal coding as matching ranked patterns that are independent of empirical data. Finally, we analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that that such codes are also universal. These results extend foundations of universal grammar-based coding and reaffirm previously stated connections to power laws for human language and language models.
Related papers
- "Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews [14.361987265496042]
This paper introduces CodedLang, a dataset of 7,744 Chinese Google Maps reviews with span-level annotations of coded language.<n>We developed a seven-class taxonomy that captures common encoding strategies, including phonetic, orthographic, and cross-lingual substitutions.<n>Results show that even strong models can fail to identify or understand coded language.
arXiv Detail & Related papers (2026-01-12T08:03:37Z) - TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar [8.34539885321864]
We show that semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming.<n>We introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization.<n>Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation.
arXiv Detail & Related papers (2025-10-16T17:59:45Z) - Read it in Two Steps: Translating Extremely Low-Resource Languages with Code-Augmented Grammar Books [30.608065466078227]
This paper investigates the role of grammar books in translating extremely low-resource languages by decomposing it into two key steps: grammar rule retrieval and application.<n>Our experiments show that using code rules significantly boosts both rule retrieval and application, ultimately resulting in a 13.1% BLEU improvement in translation.
arXiv Detail & Related papers (2025-06-02T15:36:37Z) - CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [47.17755403213469]
We propose CodeI/O, a novel approach that condenses diverse reasoning patterns embedded in contextually-grounded codes.
By training models to predict inputs/outputs given code and test cases entirely in natural language, we expose them to universal reasoning primitives.
Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks.
arXiv Detail & Related papers (2025-02-11T07:26:50Z) - Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
We train and evaluate neural networks directly as binary classifiers of strings.
We provide results on a variety of languages across the Chomsky hierarchy for three neural architectures.
Our contributions will facilitate theoretically sound empirical testing of language recognition claims in future work.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - Incremental Context-free Grammar Inference in Black Box Settings [17.601446198181048]
Black-box context-free grammar inference is a significant challenge in many practical settings.
We propose a novel method that segments example strings into smaller units and incrementally infers the grammar.
Our approach, named Kedavra, has demonstrated superior grammar quality (enhanced precision and recall), faster runtime, and improved readability through empirical comparison.
arXiv Detail & Related papers (2024-08-29T17:00:38Z) - Markov Constraint as Large Language Model Surrogate [49.86129209397701]
NgramMarkov is dedicated to text generation in constraint programming (CP)
It limits the product of the probabilities of the n-gram of a sentence.
A real-world problem has been solved for the first time using 4-grams instead of 5-grams.
arXiv Detail & Related papers (2024-06-11T16:09:53Z) - Grammar-Aligned Decoding [30.972850034752884]
Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup.
Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint.
In this paper, we demonstrate that GCD techniques can distort the LLM's distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM.
arXiv Detail & Related papers (2024-05-31T17:39:15Z) - Language Generation with Strictly Proper Scoring Rules [70.340673452404]
We propose a strategy for adapting scoring rules to language generation, allowing for language modeling with any non-local scoring rules.
We train language generation models using two classic strictly proper scoring rules, the Brier score and the Spherical score, as alternatives to the logarithmic score.
arXiv Detail & Related papers (2024-05-29T09:09:00Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks [6.390468088226495]
We propose a new method to extract and explore significant fine-grained grammar patterns from treebanks.
We extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order.
Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.
arXiv Detail & Related papers (2024-03-26T09:39:53Z) - Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search [7.822427053078387]
Generation-Augmented Retrieval (GAR) framework generates exemplar code snippets to augment queries.
We propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the for style normalization.
Code Style Similarity is the first metric tailored to quantify stylistic similarities in code.
arXiv Detail & Related papers (2024-01-09T12:12:50Z) - Addressing Stopping Failures for Small Set Flip Decoding of Hypergraph
Product Codes [1.04049929128816]
Hypergraph product codes are a promising family of constant-rate quantum LDPC codes.
Small-Set-Flip ($texttSSF$) is a linear-time decoding algorithm.
We propose a new decoding algorithm called the Projection-Along-a-Line ($texttPAL$) decoder to supplement $textttSSF$ after stopping failures.
arXiv Detail & Related papers (2023-11-01T22:08:49Z) - Fast Deterministic Black-box Context-free Grammar Inference [7.637155559284357]
State-of-the-art approach generalizes grammar rules starting from flat parse trees.
We observe that many of Arvada's generalizations violate common language concept nesting rules.
The resulting TreeVada yielded faster and higher-quality grammars in an empirical comparison.
arXiv Detail & Related papers (2023-08-11T14:45:26Z) - Efficient Semiring-Weighted Earley Parsing [71.77514972816347]
This paper provides a reference description, in the form of a deduction system, of Earley's (1970) context-free parsing algorithm with various speed-ups.
Our presentation includes a known worst-case runtime improvement from Earley's $O (N3|GR|)$, which is unworkable for the large grammars that arise in natural language processing.
We treat the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate deduction cycles, and further generalize Stolcke's method to compute the weights of sentence prefixes.
arXiv Detail & Related papers (2023-07-06T13:33:59Z) - Quantum computation on a 19-qubit wide 2d nearest neighbour qubit array [59.24209911146749]
This paper explores the relationship between the width of a qubit lattice constrained in one dimension and physical thresholds.
We engineer an error bias at the lowest level of encoding using the surface code.
We then address this bias at a higher level of encoding using a lattice-surgery surface code bus.
arXiv Detail & Related papers (2022-12-03T06:16:07Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Grammar-Based Grounded Lexicon Learning [68.59500589319023]
G2L2 is a lexicalist approach toward learning a compositional and grounded meaning representation of language.
At the core of G2L2 is a collection of lexicon entries, which map each word to a syntactic type and a neuro-symbolic semantic program.
G2L2 can generalize from small amounts of data to novel compositions of words.
arXiv Detail & Related papers (2022-02-17T18:19:53Z) - Learning grammar with a divide-and-concur neural network [4.111899441919164]
We implement a divide-and-concur iterative projection approach to context-free grammar inference.
Our method requires a relatively small number of discrete parameters, making the inferred grammar directly interpretable.
arXiv Detail & Related papers (2022-01-18T22:42:43Z) - Rule Augmented Unsupervised Constituency Parsing [11.775897250472116]
We propose an approach that utilizes very generic linguistic knowledge of the language present in the form of syntactic rules.
We achieve new state-of-the-art results on two benchmarks datasets, MNLI and WSJ.
arXiv Detail & Related papers (2021-05-21T08:06:11Z) - How (Non-)Optimal is the Lexicon? [35.91590073820011]
We take a coding-theoretic view of the lexicon and make use of a novel generative statistical model.
Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon's optimality.
We find that (compositional) morphology and graphotactics can sufficiently account for most of the complexity of natural codes.
arXiv Detail & Related papers (2021-04-29T11:55:47Z) - Autoregressive Belief Propagation for Decoding Block Codes [113.38181979662288]
We revisit recent methods that employ graph neural networks for decoding error correcting codes.
Our method violates the symmetry conditions that enable the other methods to train exclusively with the zero-word.
Despite not having the luxury of training on a single word, and the inability to train on more than a small fraction of the relevant sample space, we demonstrate effective training.
arXiv Detail & Related papers (2021-01-23T17:14:55Z) - CodeBLEU: a Method for Automatic Evaluation of Code Synthesis [57.87741831987889]
In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy.
We introduce a new automatic evaluation metric, dubbed CodeBLEU.
It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow.
arXiv Detail & Related papers (2020-09-22T03:10:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.