Transducing Language Models
- URL: http://arxiv.org/abs/2603.05193v1
- Date: Thu, 05 Mar 2026 14:04:08 GMT
- Title: Transducing Language Models
- Authors: Vésteinn Snæbjarnarson, Samuel Kiegeland, Tianyu Liu, Reda Boumasmoud, Ryan Cotterell, Tim Vieira,
- Abstract summary: We introduce a framework for language models derived from deterministic string-to-string transformations.<n>We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target.<n>We present an exact algorithm, an efficient approximation, and a theoretical analysis.
- Score: 52.080921891265255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers -- a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.
Related papers
- Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning [0.033842793760651545]
Large language models (LLMs) exhibit failure modes on seemingly trivial tasks.<n>We propose a formalisation of interaction using a deterministic multi-tape Turing machine.<n>The model enables precise localisation of failure modes to specific pipeline stages.
arXiv Detail & Related papers (2026-01-27T16:12:01Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - Language Models over Canonical Byte-Pair Encodings [56.09166157337198]
We propose methods to enforce canonicality in token-level language models.<n>We show that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
arXiv Detail & Related papers (2025-06-09T17:26:14Z) - Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Fantastically Ordered Prompts and Where to Find Them: Overcoming
Few-Shot Prompt Order Sensitivity [16.893758238773263]
When primed with only a handful of training samples, very large pretrained language models such as GPT-3, have shown competitive results.
We demonstrate that the order in which the samples are provided can be the difference between near state-of-the-art and random guess performance.
We use the generative nature of the language models to construct an artificial development set and based on entropy statistics of the candidate permutations from this set we identify performant prompts.
arXiv Detail & Related papers (2021-04-18T09:29:16Z) - Unnatural Language Inference [48.45003475966808]
We find that state-of-the-art NLI models, such as RoBERTa and BART, are invariant to, and sometimes even perform better on, examples with randomly reordered words.
Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
arXiv Detail & Related papers (2020-12-30T20:40:48Z) - Explicitly Modeling Syntax in Language Models with Incremental Parsing
and a Dynamic Oracle [88.65264818967489]
We propose a new syntax-aware language model: Syntactic Ordered Memory (SOM)
The model explicitly models the structure with an incremental and maintains the conditional probability setting of a standard language model.
Experiments show that SOM can achieve strong results in language modeling, incremental parsing and syntactic generalization tests.
arXiv Detail & Related papers (2020-10-21T17:39:15Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.