Language Generation with Infinite Contamination
- URL: http://arxiv.org/abs/2511.07417v1
- Date: Mon, 10 Nov 2025 18:59:39 GMT
- Title: Language Generation with Infinite Contamination
- Authors: Anay Mehrotra, Grigoris Velegkas, Xifan Yu, Felix Zhou,
- Abstract summary: We study language generation in the limit, where an algorithm observes an adversarial enumeration of strings from an unknown target language $K$.<n>We show that generation with density, surprisingly, remains achievable at the same generality.<n>This suggests curriculum learning may be crucial for learning from noisy web data.
- Score: 17.31852533022177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study language generation in the limit, where an algorithm observes an adversarial enumeration of strings from an unknown target language $K$ and must eventually generate new, unseen strings from $K$. Kleinberg and Mullainathan [KM24] proved that generation is achievable in surprisingly general settings. But their generator suffers from ``mode collapse,'' producing from an ever-smaller subset of the target. To address this, Kleinberg and Wei [KW25] require the generator's output to be ``dense'' in the target language. They showed that generation with density, surprisingly, remains achievable at the same generality. Both results assume perfect data: no noisy insertions and no omissions. This raises a central question: how much contamination can generation tolerate? Recent works made partial progress on this question by studying (non-dense) generation with either finite amounts of noise (but no omissions) or omissions (but no noise). We characterize robustness under contaminated enumerations: 1. Generation under Contamination: Language generation in the limit is achievable for all countable collections iff the fraction of contaminated examples converges to zero. When this fails, we characterize which collections are generable. 2. Dense Generation under Contamination: Dense generation is strictly less robust to contamination than generation. As a byproduct, we resolve an open question of Raman and Raman [ICML25] by showing that generation is possible with only membership oracle access under finitely many contaminated examples. Finally, we introduce a beyond-worst-case model inspired by curriculum learning and prove that dense generation is achievable even with infinite contamination provided the fraction of contaminated examples converges to zero. This suggests curriculum learning may be crucial for learning from noisy web data.
Related papers
- Quantifying Noise in Language Generation [0.3795745240553126]
We show that for both uniform and non-uniform generation, a single noisy string strictly reduces the set of collections that can be generated.<n>We provide the first known characterization for non-uniform noise-dependent generatability.
arXiv Detail & Related papers (2026-01-29T03:58:40Z) - Language Generation: Complexity Barriers and Implications for Learning [51.449718747429756]
We show that even for simple and well-studied language families the number of examples required for successful generation can be extraordinarily large.<n>These results reveal a substantial gap between theoretical possibility and efficient learnability.
arXiv Detail & Related papers (2025-11-07T23:06:48Z) - Language Generation in the Limit: Noise, Loss, and Feedback [10.280148603465697]
We show that a finite union of uniformly generatable collections is generatable in the limit, and asked if the same is true for non-uniform generation.<n>We show the equivalence of these models for uniform and non-uniform generation, and provide a characterization of non-uniform noisy generation.
arXiv Detail & Related papers (2025-07-21T07:18:04Z) - On Union-Closedness of Language Generation [48.36356615217017]
We investigate language generation in the limit - a model by Kleinberg and Mullainathan and extended by Li, Raman, and Tewari.<n>Our results resolve two open questions of Li et al. by proving finite unions of generatable or non-uniformly generatable classes need not be generatable.<n>Our approach utilizes carefully constructed classes along with a novel diagonalization argument that could be of independent interest in the growing area of language generation.
arXiv Detail & Related papers (2025-06-23T13:42:25Z) - On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability [16.30681257128492]
[KM24] is an algorithm for generating from any countable language collection in the limit.<n>Recent work introduces different notions of breadth and explores when generation with breadth is possible.<n>Our results show that generation with many existing notions of breadth becomes equally hard, when stability is required.
arXiv Detail & Related papers (2024-12-24T16:24:43Z) - Exploring Facets of Language Generation in the Limit [10.18252143035175]
We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit.<n>We formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation.<n>We also provide a precise characterization of the language collections for which exhaustive generation is possible.
arXiv Detail & Related papers (2024-11-22T22:13:40Z) - Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding [75.06872859716049]
Large Language Models (LLMs) have demonstrated a powerful ability for text generation.
undesired behaviors such as toxicity or hallucinations can manifest.
We propose formalizing text generation as a future-constrained generation problem.
arXiv Detail & Related papers (2023-12-11T06:35:33Z) - The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis
and Algorithm for Robust Natural Language Generation [59.7381286976957]
We show that human-like'' generations usually lie in a narrow and nearly flat entropy band.
We propose an entropy-aware decoding algorithm that respects these entropy bounds.
arXiv Detail & Related papers (2023-02-14T02:02:33Z) - Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response.
We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z) - Generating Sequences by Learning to Self-Correct [64.0249217590888]
Self-Correction decouples an imperfect base generator from a separate corrector that learns to iteratively correct imperfect generations.
We show that Self-Correction improves upon the base generator in three diverse generation tasks.
arXiv Detail & Related papers (2022-10-31T18:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.