Exploring Facets of Language Generation in the Limit
- URL: http://arxiv.org/abs/2411.15364v2
- Date: Tue, 24 Dec 2024 10:57:49 GMT
- Title: Exploring Facets of Language Generation in the Limit
- Authors: Moses Charikar, Chirag Pabbaraju,
- Abstract summary: We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit.
We formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation.
We also provide a precise characterization of the language collections for which exhaustive generation is possible.
- Score: 10.18252143035175
- License:
- Abstract: The recent work of Kleinberg & Mullainathan [KM24] provides a concrete model for language generation in the limit: given a sequence of examples from an unknown target language, the goal is to generate new examples from the target language such that no incorrect examples are generated beyond some point. In sharp contrast to strong negative results for the closely related problem of language identification, they establish positive results for language generation in the limit for all countable collections of languages. Follow-up work by Raman & Tewari [RT24] studies bounds on the number of distinct inputs required by an algorithm before correct language generation is achieved -- namely, whether this is a constant for all languages in the collection (uniform generation) or a language-dependent constant (non-uniform generation). We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit. However, while the generation algorithm of [KM24] can be implemented using membership queries, we show that any algorithm cannot non-uniformly generate even for collections of just two languages, using only membership queries. We also formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation, and show a strong negative result for exhaustive generation. Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit. We also provide a precise characterization of the language collections for which exhaustive generation is possible. Finally, inspired by algorithms that can choose to obtain feedback, we consider a model of uniform generation with feedback, completely characterizing language collections for which such uniform generation with feedback is possible in terms of a complexity measure of the collection.
Related papers
- Characterizations of Language Generation With Breadth [16.30681257128492]
We study language generation in the limit, introduced by Kleinberg and Mullainathan [KM24].
KM24 proposed an algorithm that generates strings from any countable language collection in the limit.
We show that generation with exact breadth is characterized by Angluin's condition for identification.
arXiv Detail & Related papers (2024-12-24T16:24:43Z) - On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse [16.30681257128492]
Given samples from an unknown language, a language model should produce valid strings not seen in training.
Otherwise, outputting invalid strings constitutes "hallucination," and failing to capture the full range leads to "mode collapse"
We investigate this within a statistical language generation setting building on Gold and Angluin.
arXiv Detail & Related papers (2024-11-14T18:06:55Z) - Language Generation in the Limit [0.7787343335258782]
We show that there is an agent that is able to generate in the limit for every countable list of candidate languages.
This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning.
arXiv Detail & Related papers (2024-04-10T05:53:25Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Bridging the Gap Between Training and Inference of Bayesian Controllable
Language Models [58.990214815032495]
Large-scale pre-trained language models have achieved great success on natural language generation tasks.
BCLMs have been shown to be efficient in controllable language generation.
We propose a "Gemini Discriminator" for controllable language generation which alleviates the mismatch problem with a small computational cost.
arXiv Detail & Related papers (2022-06-11T12:52:32Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Toward Cross-Lingual Definition Generation for Language Learners [10.45755551957024]
We propose to generate definitions in English for words in various languages.
Models can be directly applied to other languages after trained on the English dataset.
Experiments and manual analyses show that our models have a strong cross-lingual transfer ability.
arXiv Detail & Related papers (2020-10-12T08:45:28Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.