Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms
- URL: http://arxiv.org/abs/2403.15465v1
- Date: Tue, 19 Mar 2024 19:58:46 GMT
- Title: Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms
- Authors: Yuchao Li, Dimitri Bertsekas,
- Abstract summary: We consider a transformer with a $n$-gram structure, such as the one underlying ChatGPT.
The transformer provides next word probabilities, which can be used to generate word sequences.
We consider methods for computing word sequences that are highly likely, based on these probabilities.
- Score: 3.014160809522789
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we consider a transformer with an $n$-gram structure, such as the one underlying ChatGPT. The transformer provides next word probabilities, which can be used to generate word sequences. We consider methods for computing word sequences that are highly likely, based on these probabilities. Computing the optimal (i.e., most likely) word sequence starting with a given initial state is an intractable problem, so we propose methods to compute highly likely sequences of $N$ words in time that is a low order polynomial in $N$ and in the vocabulary size of the $n$-gram. These methods are based on the rollout approach from approximate dynamic programming, a form of single policy iteration, which can improve the performance of any given heuristic policy. In our case we use a greedy heuristic that generates as next word one that has the highest probability. We show with analysis, examples, and computational experimentation that our methods are capable of generating highly likely sequences with a modest increase in computation over the greedy heuristic. While our analysis and experiments are focused on Markov chains of the type arising in transformer and ChatGPT-like models, our methods apply to general finite-state Markov chains, and related inference applications of Hidden Markov Models (HMM), where Viterbi decoding is used extensively.
Related papers
- Exact Sequence Classification with Hardmax Transformers [0.0]
We prove that hardmax attention transformers perfectly classify datasets of $N$ labeled sequences in $mathbbRd$, $dgeq 2$.
Specifically, given $N$ sequences with an arbitrary but finite length in $mathbbRd$, we construct a transformer with $mathcalO(N)$ blocks and $mathcalO(Nd)$ parameters perfectly classifying this dataset.
arXiv Detail & Related papers (2025-02-04T12:31:00Z) - SoS Certificates for Sparse Singular Values and Their Applications: Robust Statistics, Subspace Distortion, and More [37.208622097149714]
We give a new family of upper-time algorithms which can certify bounds on the maximum of $|M u|$.
Our certification algorithm makes essential use of the Sum-of-Squares hierarchy.
arXiv Detail & Related papers (2024-12-30T18:59:46Z) - QWO: Speeding Up Permutation-Based Causal Discovery in LiGAMs [20.661343069864888]
We introduce QWO, a novel approach that significantly enhances the efficiency of computing $mathcalGpi$ for a given permutation $pi$.
QWO has a speed-up of $O(n2)$ ($n$ is the number of variables) compared to the state-of-the-art BIC-based method, making it highly scalable.
arXiv Detail & Related papers (2024-10-30T16:10:46Z) - GEC-DePenD: Non-Autoregressive Grammatical Error Correction with
Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models.
We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network.
We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z) - Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z) - Recurrently Predicting Hypergraphs [30.092688729343678]
A problem arises from the number of possible multi-way relationships, or hyperedges, scaling in $mathcalO(2n)$ for a set of $n$ elements.
We propose a recurrent hypergraph neural network that predicts the incidence matrix by iteratively refining an initial guess of the solution.
arXiv Detail & Related papers (2021-06-26T01:12:41Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Determinantal Beam Search [75.84501052642361]
Beam search is a go-to strategy for decoding neural sequence models.
In use-cases that call for multiple solutions, a diverse or representative set is often desired.
By posing iterations in beam search as a series of subdeterminant problems, we can turn the algorithm into a diverse subset selection process.
arXiv Detail & Related papers (2021-06-14T13:01:46Z) - Second-Order Information in Non-Convex Stochastic Optimization: Power
and Limitations [54.42518331209581]
We find an algorithm which finds.
epsilon$-approximate stationary point (with $|nabla F(x)|le epsilon$) using.
$(epsilon,gamma)$surimate random random points.
Our lower bounds here are novel even in the noiseless case.
arXiv Detail & Related papers (2020-06-24T04:41:43Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.