Characterizations of Language Generation With Breadth
- URL: http://arxiv.org/abs/2412.18530v1
- Date: Tue, 24 Dec 2024 16:24:43 GMT
- Title: Characterizations of Language Generation With Breadth
- Authors: Alkis Kalavasis, Anay Mehrotra, Grigoris Velegkas,
- Abstract summary: We study language generation in the limit, introduced by Kleinberg and Mullainathan [KM24].
KM24 proposed an algorithm that generates strings from any countable language collection in the limit.
We show that generation with exact breadth is characterized by Angluin's condition for identification.
- Score: 16.30681257128492
- License:
- Abstract: We study language generation in the limit, introduced by Kleinberg and Mullainathan [KM24], building on classical works of Gold [Gol67] and Angluin [Ang79]. [KM24] proposed an algorithm that generates strings from any countable language collection in the limit. While their algorithm eventually outputs strings from the target language $K$, it sacrifices breadth, i.e., the ability to generate all strings in $K$. A key open question in [KM24] is whether this trade-off between consistency and breadth is inherrent. Recent works proposed different notions of consistent generation with breadth. Kalavasis, Mehrotra, and Velegkas [KVM24] introduced three definitions: generation with exact breadth, approximate breadth, and unambiguous generation. Concurrently and independently, Charikar and Pabbaraju [CP24a] proposed exhaustive generation. Both works examined when generation with these notions of breadth is possible. Building on [CP24a, KVM24], we fully characterize language generation for these notions and their natural combinations. For exact breadth, we provide an unconditional lower bound, removing a technical condition from [KVM24] and extending the result of [CP24a] that holds for specific collections of languages. We show that generation with exact breadth is characterized by Angluin's condition for identification. We further introduce a weaker version of Angluin's condition that tightly characterizes both approximate breadth and exhaustive generation, proving their equivalence. Additionally, we show that unambiguous generation is also characterized by Angluin's condition as a special case of a broader result. Finally, we strengthen [KVM24] by giving unconditional lower bounds for stable generators, showing that Angluin's condition characterizes the previous breadth notions for stable generators. This shows a separation between stable and unstable generation with approximate breadth.
Related papers
- Exploring Facets of Language Generation in the Limit [10.18252143035175]
We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit.
We formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation.
We also provide a precise characterization of the language collections for which exhaustive generation is possible.
arXiv Detail & Related papers (2024-11-22T22:13:40Z) - On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse [16.30681257128492]
Given samples from an unknown language, a language model should produce valid strings not seen in training.
Otherwise, outputting invalid strings constitutes "hallucination," and failing to capture the full range leads to "mode collapse"
We investigate this within a statistical language generation setting building on Gold and Angluin.
arXiv Detail & Related papers (2024-11-14T18:06:55Z) - Return of Unconditional Generation: A Self-supervised Representation Generation Method [36.27605000082541]
Unconditional generation is a problem of modeling data distribution without relying on human-annotated labels.
In this work, we show that one can close this gap by generating semantic representations in the representation space produced by a self-supervised encoder.
This framework, called Representation-Conditioned Generation (RCG), provides an effective solution to the unconditional generation problem without using labels.
arXiv Detail & Related papers (2023-12-06T18:59:31Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - Combinatorial NLTS From the Overlap Gap Property [2.594420805049218]
Anshu, Breuckmann, and Nirkhe [ABN22] resolved positively the so-called No Low-Energy Trivial State conjecture by Freedman and Hastings.
The conjecture postulated the existence of linear-size local Hamiltonians on n qubit systems for which no near-ground state can be prepared by a shallow (sublogarithmic depth) circuit.
arXiv Detail & Related papers (2023-04-02T22:16:26Z) - A Measure-Theoretic Characterization of Tight Language Models [105.16477132329416]
In some pathological cases, probability mass can leak'' onto the set of infinite sequences.
This paper offers a measure-theoretic treatment of language modeling.
We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense.
arXiv Detail & Related papers (2022-12-20T18:17:11Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Hidden Cosets and Applications to Unclonable Cryptography [15.248351992500078]
We study a generalization of hidden subspace states to hidden coset states (first introduced by Aaronson and Christiano [STOC '12]).
We explore unclonable properties of coset states and several applications.
arXiv Detail & Related papers (2021-07-12T19:04:01Z) - Language learnability in the limit for general metrics: a Gold-Angluin
result [91.3755431537592]
We use Niyogi's extended version of a theorem by Blum and Blum (1975) on the existence of locking data sets to prove a necessary condition for learnability in the limit of any family of languages in any given metric.
When the language family is further assumed to contain all finite languages, the same condition also becomes sufficient for learnability in the limit.
arXiv Detail & Related papers (2021-03-24T13:11:09Z) - Conditional Hybrid GAN for Sequence Generation [56.67961004064029]
We propose a novel conditional hybrid GAN (C-Hybrid-GAN) to solve this issue.
We exploit the Gumbel-Softmax technique to approximate the distribution of discrete-valued sequences.
We demonstrate that the proposed C-Hybrid-GAN outperforms the existing methods in context-conditioned discrete-valued sequence generation.
arXiv Detail & Related papers (2020-09-18T03:52:55Z) - Discrete Variational Attention Models for Language Generation [51.88612022940496]
We propose a discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages.
Thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse.
arXiv Detail & Related papers (2020-04-21T05:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.