Related papers: From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

URL: http://arxiv.org/abs/2402.13512v1
Date: Wed, 21 Feb 2024 03:51:34 GMT
Title: From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
Authors: M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat and Samet Oymak
Abstract summary: We study learning a 1-layer self-attention model from a set of prompts and associated output data. We first establish a precise mapping between the self-attention mechanism and Markov models. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens.
Score: 41.82477691012942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties.

Related papers

Learning Extrapolative Sequence Transformations from Markov Chains [6.161395208969171]
We show that an autoregressive model can efficiently generate novel sequences that extrapolate along the sequence-level properties of interest.<n>The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization.
arXiv Detail & Related papers (2025-05-26T17:27:47Z)
Text Generation Beyond Discrete Token Sampling [75.96920867382859]
Mixture of Inputs (MoI) is a training-free method for autoregressive generation.<n>MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B.
arXiv Detail & Related papers (2025-05-20T18:41:46Z)
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment [12.319685395140862]
We propose a framework that exploits and aligns the state-transition graph structures shared by time-series and linguistic data as sequential modalities.<n> Experiments on multiple benchmarks demonstrate that SGCMA achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-05-19T14:30:41Z)
A Unified Approach to Analysis and Design of Denoising Markov Models [11.975300242253496]
We aim to establish a rigorous mathematical foundation for denoising Markov models. We propose a minimal set of assumptions that ensure explicit construction of the backward generator. Our framework unifies existing formulations of continuous and discrete diffusion models.
arXiv Detail & Related papers (2025-04-02T17:46:43Z)
FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching [51.32059240975148]
FELLE is an autoregressive model that integrates language modeling with token-wise flow matching. For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step. FELLE generates continuous-valued tokens hierarchically, conditioned on the language model's output.
arXiv Detail & Related papers (2025-02-16T13:54:32Z)
Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS) MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z)
Dynamical mixture modeling with fast, automatic determination of Markov chains [0.0]
Variational EM efficiently identifies the number of Markov chains and dynamics of each chain without expensive model comparisons or posterior sampling. The approach is supported by a theoretical analysis and numerical experiments, including simulated and observational data sets based on $tt Last.fm$ music listening, ultramarathon running, and gene expression.
arXiv Detail & Related papers (2024-06-07T05:43:11Z)
σ-GPTs: A New Approach to Autoregressive Models [19.84252724050016]
We show that by simply adding a positional encoding for the output, this order can be modulated on-the-fly per-sample. We evaluate our method across various domains, including language modeling, path-solving, and aircraft vertical rate prediction.
arXiv Detail & Related papers (2024-04-15T08:22:47Z)
A Fixed-Point Approach for Causal Generative Modeling [20.88890689294816]
We propose a novel formalism for describing Structural Causal Models (SCMs) as fixed-point problems on causally ordered variables. We establish the weakest known conditions for their unique recovery given the topological ordering (TO)
arXiv Detail & Related papers (2024-04-10T12:29:05Z)
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains [28.41876902994335]
We introduce a simple Markov Chain sequence modeling task to study how this in-context learning (ICL) capability emerges. Transformers trained on this task form emphstatistical induction heads which compute accurate next-token probabilities. We show how successful learning results from the interaction between the transformer's layers, and uncover evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution.
arXiv Detail & Related papers (2024-02-16T18:28:36Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision [75.1860418333995]
Programmatic Weak Supervision (PWS) has emerged as a widespread paradigm to synthesize training labels efficiently. The core component of PWS is the label model, which infers true labels by aggregating the outputs of multiple noisy supervision sources as labeling functions. Existing statistical label models typically rely only on the outputs of LF, ignoring the instance features when modeling the underlying generative process.
arXiv Detail & Related papers (2022-10-06T07:28:53Z)
Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept [56.46135010588918]
We prove that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent. It is shown that blank probabilities translate into segment length probabilities and vice versa.
arXiv Detail & Related papers (2021-04-13T11:20:48Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.