Language Generation in the Limit
- URL: http://arxiv.org/abs/2404.06757v1
- Date: Wed, 10 Apr 2024 05:53:25 GMT
- Title: Language Generation in the Limit
- Authors: Jon Kleinberg, Sendhil Mullainathan,
- Abstract summary: We show that there is an agent that is able to generate in the limit for every countable list of candidate languages.
This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning.
- Score: 0.7787343335258782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from L in the limit if after some finite point in the enumeration of L, the agent is able to produce new elements that come exclusively from L and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it.
Related papers
- Exploring Facets of Language Generation in the Limit [10.18252143035175]
We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit.
We formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation.
Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit.
arXiv Detail & Related papers (2024-11-22T22:13:40Z) - On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse [16.30681257128492]
Given samples from an unknown language, a language model should produce valid strings not seen in training.
Otherwise, outputting invalid strings constitutes "hallucination," and failing to capture the full range leads to "mode collapse"
We investigate this within a statistical language generation setting building on Gold and Angluin.
arXiv Detail & Related papers (2024-11-14T18:06:55Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models [23.105555180223487]
modeLing is a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems.
We evaluate several large open source language models and GPT on our benchmark.
arXiv Detail & Related papers (2024-06-24T18:00:59Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Detecting Languages Unintelligible to Multilingual Models through Local
Structure Probes [15.870989191524094]
We develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model.
Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language.
arXiv Detail & Related papers (2022-11-09T16:45:16Z) - Lifting the Curse of Multilinguality by Pre-training Modular
Transformers [72.46919537293068]
multilingual pre-trained models suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages.
We introduce language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant.
Our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.
arXiv Detail & Related papers (2022-05-12T17:59:56Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.