On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability
- URL: http://arxiv.org/abs/2412.18530v2
- Date: Thu, 03 Jul 2025 15:39:41 GMT
- Title: On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability
- Authors: Alkis Kalavasis, Anay Mehrotra, Grigoris Velegkas,
- Abstract summary: [KM24] is an algorithm for generating from any countable language collection in the limit.<n>Recent work introduces different notions of breadth and explores when generation with breadth is possible.<n>Our results show that generation with many existing notions of breadth becomes equally hard, when stability is required.
- Score: 16.30681257128492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study language generation in the limit - introduced by Kleinberg and Mullainathan [KM24] - building on classical works of Gold [Gol67] and Angluin [Ang79]. [KM24]'s main result is an algorithm for generating from any countable language collection in the limit. While their algorithm eventually generates unseen strings from the target language $K$, it sacrifices coverage or breadth, i.e., its ability to generate a rich set of strings. Recent work introduces different notions of breadth and explores when generation with breadth is possible, leaving a full characterization of these notions open. Our first set of results settles this by characterizing generation for existing notions of breadth and their natural extensions. Interestingly, our lower bounds are very flexible and hold for many performance metrics beyond breadth - for instance, showing that, in general, it is impossible to train generators which achieve a higher perplexity or lower hallucination rate for $K$ compared to other languages. Next, we study language generation with breadth and stable generators - algorithms that eventually stop changing after seeing an arbitrary but finite number of strings - and prove unconditional lower bounds for such generators, strengthening the results of [KMV25] and demonstrating that generation with many existing notions of breadth becomes equally hard, when stability is required. This gives a separation for generation with approximate breadth, between stable and unstable generators, highlighting the rich interplay between breadth, stability, and consistency in language generation.
Related papers
- Language Generation in the Limit: Noise, Loss, and Feedback [10.280148603465697]
We show that a finite union of uniformly generatable collections is generatable in the limit, and asked if the same is true for non-uniform generation.<n>We show the equivalence of these models for uniform and non-uniform generation, and provide a characterization of non-uniform noisy generation.
arXiv Detail & Related papers (2025-07-21T07:18:04Z) - On Union-Closedness of Language Generation [48.36356615217017]
We investigate language generation in the limit - a model by Kleinberg and Mullainathan and extended by Li, Raman, and Tewari.<n>Our results resolve two open questions of Li et al. by proving finite unions of generatable or non-uniformly generatable classes need not be generatable.<n>Our approach utilizes carefully constructed classes along with a novel diagonalization argument that could be of independent interest in the growing area of language generation.
arXiv Detail & Related papers (2025-06-23T13:42:25Z) - Density Measures for Language Generation [2.2872032473279065]
We study the trade-off between validity and breadth of language generation algorithms.
Existing algorithms for language generation in the limit produce output sets that can have zero density in the true language.
We show, however, that we provide an algorithm for language generation in the limit whose outputs have strictly positive density in $K$.
arXiv Detail & Related papers (2025-04-19T18:08:18Z) - Exploring Facets of Language Generation in the Limit [10.18252143035175]
We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit.
We formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation.
Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit.
arXiv Detail & Related papers (2024-11-22T22:13:40Z) - On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse [16.30681257128492]
Given samples from an unknown language, a language model should produce valid strings not seen in training.
Otherwise, outputting invalid strings constitutes "hallucination," and failing to capture the full range leads to "mode collapse"
We investigate this within a statistical language generation setting building on Gold and Angluin.
arXiv Detail & Related papers (2024-11-14T18:06:55Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding [75.06872859716049]
Large Language Models (LLMs) have demonstrated a powerful ability for text generation.
undesired behaviors such as toxicity or hallucinations can manifest.
We propose formalizing text generation as a future-constrained generation problem.
arXiv Detail & Related papers (2023-12-11T06:35:33Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - Combinatorial NLTS From the Overlap Gap Property [2.594420805049218]
Anshu, Breuckmann, and Nirkhe [ABN22] resolved positively the so-called No Low-Energy Trivial State conjecture by Freedman and Hastings.
The conjecture postulated the existence of linear-size local Hamiltonians on n qubit systems for which no near-ground state can be prepared by a shallow (sublogarithmic depth) circuit.
arXiv Detail & Related papers (2023-04-02T22:16:26Z) - The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis
and Algorithm for Robust Natural Language Generation [59.7381286976957]
We show that human-like'' generations usually lie in a narrow and nearly flat entropy band.
We propose an entropy-aware decoding algorithm that respects these entropy bounds.
arXiv Detail & Related papers (2023-02-14T02:02:33Z) - A Measure-Theoretic Characterization of Tight Language Models [105.16477132329416]
In some pathological cases, probability mass can leak'' onto the set of infinite sequences.
This paper offers a measure-theoretic treatment of language modeling.
We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense.
arXiv Detail & Related papers (2022-12-20T18:17:11Z) - Generating Sequences by Learning to Self-Correct [64.0249217590888]
Self-Correction decouples an imperfect base generator from a separate corrector that learns to iteratively correct imperfect generations.
We show that Self-Correction improves upon the base generator in three diverse generation tasks.
arXiv Detail & Related papers (2022-10-31T18:09:51Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Hidden Cosets and Applications to Unclonable Cryptography [15.248351992500078]
We study a generalization of hidden subspace states to hidden coset states (first introduced by Aaronson and Christiano [STOC '12]).
We explore unclonable properties of coset states and several applications.
arXiv Detail & Related papers (2021-07-12T19:04:01Z) - Language learnability in the limit for general metrics: a Gold-Angluin
result [91.3755431537592]
We use Niyogi's extended version of a theorem by Blum and Blum (1975) on the existence of locking data sets to prove a necessary condition for learnability in the limit of any family of languages in any given metric.
When the language family is further assumed to contain all finite languages, the same condition also becomes sufficient for learnability in the limit.
arXiv Detail & Related papers (2021-03-24T13:11:09Z) - Conditional Hybrid GAN for Sequence Generation [56.67961004064029]
We propose a novel conditional hybrid GAN (C-Hybrid-GAN) to solve this issue.
We exploit the Gumbel-Softmax technique to approximate the distribution of discrete-valued sequences.
We demonstrate that the proposed C-Hybrid-GAN outperforms the existing methods in context-conditioned discrete-valued sequence generation.
arXiv Detail & Related papers (2020-09-18T03:52:55Z) - Discrete Variational Attention Models for Language Generation [51.88612022940496]
We propose a discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages.
Thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse.
arXiv Detail & Related papers (2020-04-21T05:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.