Related papers: StackTrans: From Large Language Model to Large Pushdown Automata Model

StackTrans: From Large Language Model to Large Pushdown Automata Model

URL: http://arxiv.org/abs/2507.15343v2
Date: Mon, 04 Aug 2025 10:12:31 GMT
Title: StackTrans: From Large Language Model to Large Pushdown Automata Model
Authors: Kechi Zhang, Ge Li, Jia Li, Huangzhao Zhang, Yihong Dong, Jia Li, Jingjing Xu, Zhi Jin,
Abstract summary: The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence.<n>We propose StackTrans to address the issue within large language models (LLMs)<n>Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers.
Score: 63.37860663635374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively capture the Chomsky hierarchy, such as regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations -- such as pushing and popping hidden states -- that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchies and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2-3x more parameters, showcasing its superior efficiency and reasoning capability.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism. Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns [17.144569385099462]
We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
arXiv Detail & Related papers (2023-10-03T02:18:06Z)
Learning Multiscale Transformer Models for Sequence Generation [33.73729074207944]
We build a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.
arXiv Detail & Related papers (2022-06-19T07:28:54Z)
Designing Effective Sparse Expert Models [45.21279650229869]
Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer.
arXiv Detail & Related papers (2022-02-17T21:39:10Z)
SML: a new Semantic Embedding Alignment Transformer for efficient cross-lingual Natural Language Inference [71.57324258813674]
The ability of Transformers to perform with precision a variety of tasks such as question answering, Natural Language Inference (NLI) or summarising, have enable them to be ranked as one of the best paradigms to address this kind of tasks at present. NLI is one of the best scenarios to test these architectures, due to the knowledge required to understand complex sentences and established a relation between a hypothesis and a premise. In this paper, we propose a new architecture, siamese multilingual transformer, to efficiently align multilingual embeddings for Natural Language Inference.
arXiv Detail & Related papers (2021-03-17T13:23:53Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.