Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations
- URL: http://arxiv.org/abs/2511.05295v1
- Date: Fri, 07 Nov 2025 14:56:04 GMT
- Title: Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations
- Authors: Jon Kleinberg, Fan Wei,
- Abstract summary: We study the framework of emphlanguage generation in the limit, where an adversary enumerates strings from an unknown language.<n>We show that generation in the limit remains achievable, and if $C$ has lower density $alpha$ in $K$, the algorithm's output achieves density at least $alpha/2$.<n>We characterize when identification in the limit is possible -- when hypotheses $M_t$ eventually satisfy $C subseteq M subseteq K$.
- Score: 1.6203023551115867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \emph{language generation in the limit}, where an adversary enumerates strings from an unknown language $K$ drawn from a countable class, and an algorithm must generate unseen strings from $K$. Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \emph{validity--breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of $1/2$ on the best achievable lower density. We then strengthen the model to allow \emph{partial enumeration}, where the adversary reveals only an infinite subset $C \subseteq K$. We show that generation in the limit remains achievable, and if $C$ has lower density $\alpha$ in $K$, the algorithm's output achieves density at least $\alpha/2$, matching the upper bound. This generalizes the $1/2$ bound to the partial-information setting, where the generator must recover within a factor $1/2$ of the revealed subset's density. We further revisit the classical Gold--Angluin model of \emph{language identification} under partial enumeration. We characterize when identification in the limit is possible -- when hypotheses $M_t$ eventually satisfy $C \subseteq M \subseteq K$ -- and in the process give a new topological formulation of Angluin's characterization, showing that her condition is precisely equivalent to an appropriate topological space having the $T_D$ separation property.
Related papers
- Quantum Search With Generalized Wildcards [0.4310167974376404]
We show a quantum algorithm of cost $O(sqrtn log n)$ and a near-matching lower bound of $Omega(sqrtn)$.<n>We show near-tight bounds when $calQ$ is any of the following collections: bounded-size sets, contiguous blocks, prefixes, and only the full set.
arXiv Detail & Related papers (2025-11-06T18:55:05Z) - Pareto-optimal Non-uniform Language Generation [11.279808969568252]
We show that an algorithm whose generation time for some language $L$ is strictly smaller than $tstar(L)$ must satisfy that its generation time for some other language $L'$ is strictly worse than $tstar(L')$.<n>Our framework conveniently adapts to give non-uniform generation algorithms in the practically motivated settings of noisy as well as representative generation.
arXiv Detail & Related papers (2025-10-03T08:08:20Z) - Density Measures for Language Generation [2.2872032473279065]
We study the trade-off between validity and breadth of language generation algorithms.<n>Existing algorithms for language generation in the limit produce output sets that can have zero density in the true language.<n>We show, however, that we provide an algorithm for language generation in the limit whose outputs have strictly positive density in $K$.
arXiv Detail & Related papers (2025-04-19T18:08:18Z) - A simple and improved algorithm for noisy, convex, zeroth-order optimisation [59.51990161522328]
We construct an algorithm that returns a point $hat xin barmathcal X$ such that $f(hat x)$ is as small as possible.
We prove that this method is such that the $f(hat x) - min_xin barmathcal X f(x)$ is of smaller order than $d2/sqrtn$ up to poly-logarithmic terms.
arXiv Detail & Related papers (2024-06-26T18:19:10Z) - Distribution-Independent Regression for Generalized Linear Models with
Oblivious Corruptions [49.69852011882769]
We show the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise.
We present an algorithm that tackles newthis problem in its most general distribution-independent setting.
This is the first newalgorithmic result for GLM regression newwith oblivious noise which can handle more than half the samples being arbitrarily corrupted.
arXiv Detail & Related papers (2023-09-20T21:41:59Z) - Detection of Dense Subhypergraphs by Low-Degree Polynomials [72.4451045270967]
Detection of a planted dense subgraph in a random graph is a fundamental statistical and computational problem.
We consider detecting the presence of a planted $Gr(ngamma, n-alpha)$ subhypergraph in a $Gr(n, n-beta) hypergraph.
Our results are already new in the graph case $r=2$, as we consider the subtle log-density regime where hardness based on average-case reductions is not known.
arXiv Detail & Related papers (2023-04-17T10:38:08Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed
Bandits [22.18577284185939]
We develop robust best-of-both-worlds algorithms for heavy-tailed multi-armed bandits.
textttHTINF simultaneously achieves the optimal regret for both and adversarial environments.
textttAdaTINF is the first algorithm that can adapt to both $alpha$ and $sigma$ to achieve optimal gap-indepedent regret bound.
arXiv Detail & Related papers (2022-01-28T03:53:39Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - RNNs can generate bounded hierarchical languages with optimal memory [113.73133308478612]
We show that RNNs can efficiently generate bounded hierarchical languages that reflect the scaffolding of natural language syntax.
We introduce Dyck-($k$,$m$), the language of well-nested brackets (of $k$ types) and $m$-bounded nesting depth.
We prove that an RNN with $O(m log k)$ hidden units suffices, an exponential reduction in memory, by an explicit construction.
arXiv Detail & Related papers (2020-10-15T04:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.