Related papers: Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators

Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators

URL: http://arxiv.org/abs/2504.10845v1
Date: Tue, 15 Apr 2025 04:06:27 GMT
Title: Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators
Authors: Phill Kyu Rhee,
Abstract summary: Large Language Models (LLMs) powered by Transformers have demonstrated human-like intelligence capabilities.<n>This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators.
Score: 0.40792653193642503
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs), powered by Transformers, have demonstrated human-like intelligence capabilities, yet their underlying mechanisms remain poorly understood. This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators. We hypothesize that Transformers can be effectively decomposed into three fundamental components: context windows, attention mechanisms, and autoregressive generation frameworks. This decomposition allows for the development of more flexible and interpretable computational models, moving beyond the traditional view of attention and autoregression as inseparable processes. We argue that next-token predictions can be understood as probabilistic, dynamic approximations of left CSL production rules, providing an intuitive explanation for how simple token predictions can yield human-like intelligence outputs. Given that all CSLs are left context-sensitive (Penttonen, 1974), we conclude that Transformers stochastically approximate CSLs, which are widely recognized as models of human-like intelligence. This interpretation bridges the gap between Formal Language Theory and the observed generative power of Transformers, laying a foundation for future advancements in generative AI theory and applications. Our novel perspective on Transformer architectures will foster a deeper understanding of LLMs and their future potentials.

Related papers

Characterizing the Expressivity of Transformer Language Models [56.598551673153366]
We provide an exact characterization of fixed-precision transformers with strict future masking and soft attention.<n>We show that these models are precisely as expressive as a specific fragment of linear temporal logic.<n>We further relate this logic to established classes in formal language theory, automata theory, and algebra.
arXiv Detail & Related papers (2025-05-29T16:30:30Z)
Enhancing Latent Computation in Transformers with Latent Tokens [48.371764897314]
Augmenting large language models with auxiliary tokens has emerged as a promising strategy for enhancing model performance.<n>We introduce a lightweight method termed latent tokens; these are dummy tokens that may be non-interpretable in natural language.<n>The proposed latent tokens can be seamlessly integrated with a pre-trained Transformer, trained in a parameter-efficient manner, and applied flexibly at inference time.
arXiv Detail & Related papers (2025-05-19T02:35:53Z)
Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning [31.632816425798108]
Tokenization is a necessary component within the current architecture of many language models.<n>We discuss how tokens and pretraining can act as a backdoor for bias and other unwanted content.<n>We relay evidence that the tokenization algorithm's objective function impacts the large language model's cognition.
arXiv Detail & Related papers (2024-12-14T18:18:52Z)
Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks [78.54913566111198]
Large Language Models (LLMs) have demonstrated impressive abilities in symbol processing through in-context learning (ICL) We seek to understand the mechanisms that can enable robust symbol processing in transformer networks. We develop a high-level language, PSL, that allows us to write symbolic programs to do complex, abstract symbol processing.
arXiv Detail & Related papers (2024-10-23T01:38:10Z)
Dynamic Universal Approximation Theory: The Basic Theory for Transformer-based Large Language Models [9.487731634351787]
Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms.<n>This paper explores the theoretical foundations of large language models (LLMs)<n>It offers a theoretical backdrop, shedding light on the mechanisms that underpin these advancements.
arXiv Detail & Related papers (2024-07-01T04:29:35Z)
On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning [87.73401758641089]
Chain-of-thought (CoT) reasoning has improved the performance of modern language models (LMs)<n>We show that LMs can represent the same family of distributions over strings as probabilistic Turing machines.
arXiv Detail & Related papers (2024-06-20T10:59:02Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Transformers Can Represent $n$-gram Language Models [56.06361029539347]
We focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM.
arXiv Detail & Related papers (2024-04-23T12:51:37Z)
Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention. Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z)
Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights. This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z)
On the Ability and Limitations of Transformers to Recognize Formal Languages [9.12267978757844]
We provide a construction of Transformers for a subclass of counter languages. We find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction. Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance.
arXiv Detail & Related papers (2020-09-23T17:21:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.