Quantifying Noise in Language Generation
- URL: http://arxiv.org/abs/2601.21237v1
- Date: Thu, 29 Jan 2026 03:58:40 GMT
- Title: Quantifying Noise in Language Generation
- Authors: Aaron Li, Ian Zhang,
- Abstract summary: We show that for both uniform and non-uniform generation, a single noisy string strictly reduces the set of collections that can be generated.<n>We provide the first known characterization for non-uniform noise-dependent generatability.
- Score: 0.3795745240553126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Kleinberg and Mullainathan recently proposed a formal framework for studying the phenomenon of language generation, called language generation in the limit. In this model, an adversary gives an enumeration of example strings from an unknown target language, and the algorithm is tasked with correctly generating unseen strings from the target language within finite time. Refined notions of non-uniform and uniform generation were later introduced by Li, Raman, and Tewari (2025), and a noisy model was introduced by Raman and Raman (2025), which allows the adversary to insert extraneous strings. A natural question in the noisy model is to quantify the effect of noise, by studying the impact of each additional extraneous string. We show two complementary results in this setting. We first show that for both uniform and non-uniform generation, a single noisy string strictly reduces the set of collections that can be generated, thus answering an open question in Raman and Raman (2025). Then, we show for both uniform and non-uniform generation that generation with a single noisy string is equivalent to generation with any finite amount of noise, sharply contrasting with the strict hierarchy for noisy generation in the limit shown by Bai, Panigrahi, and Zhang (2026). Finally, we leverage our previous results to provide the first known characterization for non-uniform noise-dependent generatability.
Related papers
- Language Generation with Infinite Contamination [17.31852533022177]
We study language generation in the limit, where an algorithm observes an adversarial enumeration of strings from an unknown target language $K$.<n>We show that generation with density, surprisingly, remains achievable at the same generality.<n>This suggests curriculum learning may be crucial for learning from noisy web data.
arXiv Detail & Related papers (2025-11-10T18:59:39Z) - Language Generation in the Limit: Noise, Loss, and Feedback [10.280148603465697]
We show that a finite union of uniformly generatable collections is generatable in the limit, and asked if the same is true for non-uniform generation.<n>We show the equivalence of these models for uniform and non-uniform generation, and provide a characterization of non-uniform noisy generation.
arXiv Detail & Related papers (2025-07-21T07:18:04Z) - On Union-Closedness of Language Generation [48.36356615217017]
We investigate language generation in the limit - a model by Kleinberg and Mullainathan and extended by Li, Raman, and Tewari.<n>Our results resolve two open questions of Li et al. by proving finite unions of generatable or non-uniformly generatable classes need not be generatable.<n>Our approach utilizes carefully constructed classes along with a novel diagonalization argument that could be of independent interest in the growing area of language generation.
arXiv Detail & Related papers (2025-06-23T13:42:25Z) - On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability [16.30681257128492]
[KM24] is an algorithm for generating from any countable language collection in the limit.<n>Recent work introduces different notions of breadth and explores when generation with breadth is possible.<n>Our results show that generation with many existing notions of breadth becomes equally hard, when stability is required.
arXiv Detail & Related papers (2024-12-24T16:24:43Z) - Exploring Facets of Language Generation in the Limit [10.18252143035175]
We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit.<n>We formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation.<n>We also provide a precise characterization of the language collections for which exhaustive generation is possible.
arXiv Detail & Related papers (2024-11-22T22:13:40Z) - Gumbel Counterfactual Generation From Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.<n>We propose a framework for generating true string counterfactuals.<n>We show that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.
arXiv Detail & Related papers (2024-11-11T17:57:30Z) - The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis
and Algorithm for Robust Natural Language Generation [59.7381286976957]
We show that human-like'' generations usually lie in a narrow and nearly flat entropy band.
We propose an entropy-aware decoding algorithm that respects these entropy bounds.
arXiv Detail & Related papers (2023-02-14T02:02:33Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - Locally Typical Sampling [84.62530743899025]
We show that today's probabilistic language generators fall short when it comes to producing coherent and fluent text.<n>We propose a simple and efficient procedure for enforcing this criterion when generating from probabilistic models.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Unconditional Audio Generation with Generative Adversarial Networks and
Cycle Regularization [48.55126268721948]
We present a generative adversarial network (GAN)-based model for unconditional generation of the mel-spectrograms of singing voices.
We employ a hierarchical architecture in the generator to induce some structure in the temporal dimension.
We evaluate the performance of the new model not only for generating singing voices, but also for generating speech voices.
arXiv Detail & Related papers (2020-05-18T08:35:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.