Limitations of Autoregressive Models and Their Alternatives
- URL: http://arxiv.org/abs/2010.11939v3
- Date: Mon, 31 May 2021 02:09:15 GMT
- Title: Limitations of Autoregressive Models and Their Alternatives
- Authors: Chu-Cheng Lin and Aaron Jaech and Xin Li and Matthew R. Gormley and
Jason Eisner
- Abstract summary: These limitations apply no matter how much computation and data are used to train the model.
Energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string) are powerful enough to escape these limitations.
- Score: 31.827580420643606
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Standard autoregressive language models perform only polynomial-time
computation to compute the probability of the next symbol. While this is
attractive, it means they cannot model distributions whose next-symbol
probability is hard to compute. Indeed, they cannot even model them well enough
to solve associated easy decision problems for which an engineer might want to
consult a language model. These limitations apply no matter how much
computation and data are used to train the model, unless the model is given
access to oracle parameters that grow superpolynomially in sequence length.
Thus, simply training larger autoregressive language models is not a panacea
for NLP. Alternatives include energy-based models (which give up efficient
sampling) and latent-variable autoregressive models (which give up efficient
scoring of a given string). Both are powerful enough to escape the above
limitations.
Related papers
- XLM: A Python package for non-autoregressive language models [22.95539075474856]
In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling.<n>We present the XLM python package, which is designed to make implementing small non-autoregressive language models faster.
arXiv Detail & Related papers (2025-12-18T21:05:10Z) - Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z) - Model Stealing for Any Low-Rank Language Model [25.16701867917684]
We build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting.
Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution.
This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance.
arXiv Detail & Related papers (2024-11-12T04:25:31Z) - Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization [22.90653167145603]
We introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions.
As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts.
arXiv Detail & Related papers (2024-09-19T16:50:26Z) - Beyond Closure Models: Learning Chaotic-Systems via Physics-Informed Neural Operators [78.64101336150419]
Predicting the long-term behavior of chaotic systems is crucial for various applications such as climate modeling.
An alternative approach to such a full-resolved simulation is using a coarse grid and then correcting its errors through a temporalittext model.
We propose an alternative end-to-end learning approach using a physics-informed neural operator (PINO) that overcomes this limitation.
arXiv Detail & Related papers (2024-08-09T17:05:45Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Induced Model Matching: Restricted Models Help Train Full-Featured Models [1.4963011898406866]
We consider scenarios where a very accurate (often small) predictive model using restricted features is available when training a full-featured (often larger) model.
How can the restricted model be useful to the full model?
We introduce a methodology called Induced Model Matching (IMM)
IMM aligns the context-restricted, or induced, version of the large model with the restricted model.
arXiv Detail & Related papers (2024-02-19T20:21:09Z) - A Pseudo-Semantic Loss for Autoregressive Models with Logical
Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning.
We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution.
We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z) - ArthModel: Enhance Arithmetic Skills to Large Language Model [0.0]
This work provides different ways of thinking, training and using a language model.
The codes and models will be released at urlhttps://www.eteced.com/eteced/arithmetic_finetuning_v1.
arXiv Detail & Related papers (2023-11-30T15:06:50Z) - Meaning Representations from Trajectories in Autoregressive Models [106.63181745054571]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text.
This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model.
We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle.
arXiv Detail & Related papers (2023-10-23T04:35:58Z) - Oracle Inequalities for Model Selection in Offline Reinforcement
Learning [105.74139523696284]
We study the problem of model selection in offline RL with value function approximation.
We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal inequalities up to logarithmic factors.
We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
arXiv Detail & Related papers (2022-11-03T17:32:34Z) - Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive
Language Models [76.22217735434661]
This paper introduces two new classes of generative models for categorical data: Argmax Flows and Multinomial Diffusion.
We demonstrate that our models perform competitively on language modelling and modelling of image segmentation maps.
arXiv Detail & Related papers (2021-02-10T11:04:17Z) - Interventions and Counterfactuals in Tractable Probabilistic Models:
Limitations of Contemporary Transformations [12.47276164048813]
We show that when transforming SPNs to a causal graph interventional reasoning reduces to computing marginal distributions.
We first provide an algorithm for constructing a causal graph from a PSDD, which introduces augmented variables.
arXiv Detail & Related papers (2020-01-29T15:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.