Related papers: pLSTM: parallelizable Linear Source Transition Mark networks

pLSTM: parallelizable Linear Source Transition Mark networks

URL: http://arxiv.org/abs/2506.11997v1
Date: Fri, 13 Jun 2025 17:51:37 GMT
Title: pLSTM: parallelizable Linear Source Transition Mark networks
Authors: Korbinian Pöppel, Richard Freinschlag, Thomas Schmied, Wei Lin, Sepp Hochreiter,
Abstract summary: We introduce parallelizable Linear Source Transition Mark networks (pLSTMs) using Source, Transition, and Mark gates.<n>pLSTMs tackle the vanishing/exploding activation/gradient problem for long distances in DAGs via two distinct modes.<n>We demonstrate that pLSTMs generalize well to larger image sizes, whereas Transformers struggle to extrapolate.
Score: 10.620405837091022
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern recurrent architectures, such as xLSTM and Mamba, have recently challenged the Transformer in language modeling. However, their structure constrains their applicability to sequences only or requires processing multi-dimensional data structures, such as images or molecular graphs, in a pre-defined sequential order. In contrast, Multi-Dimensional RNNs (MDRNNs) are well suited for data with a higher level structure, like 2D grids, trees, and directed acyclic graphs (DAGs). In this work, we extend the notion of multi-dimensionality to linear RNNs. We introduce parallelizable Linear Source Transition Mark networks (pLSTMs) using Source, Transition, and Mark gates that act on the line graph of a general DAG. This enables parallelization in analogy to parallel associative scans and the chunkwise-recurrent form of sequential linear RNNs, but for DAGs. For regular grids (1D and 2D), like images, this scheme can be efficiently implemented using einsum operations, concatenations, and padding in logarithmic time. pLSTMs tackle the vanishing/exploding activation/gradient problem for long distances in DAGs via two distinct modes: a directed propagation mode (P-mode) and a diffusive distribution mode (D-mode). To showcase the long-range capabilities of pLSTM, we introduce arrow-pointing extrapolation as a synthetic computer vision task that contains long-distance directional information. We demonstrate that pLSTMs generalize well to larger image sizes, whereas Transformers struggle to extrapolate. On established molecular graph and computer vision benchmarks, pLSTMs also show strong performance. Code and Datasets are available at: https://github.com/ml-jku/plstm_experiments.

Related papers

Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models [6.389310720722303]
We provide a unifying framework for sequence models with structured, input-dependent state-transition matrices.<n>We prove that, unlike the diagonal state-transition matrices of S4 and Mamba, SLiCEs employ block-diagonal, sparse, or Walsh--Hadamard matrices.<n> Empirically, SLiCEs solve the $A_5$ state-tracking benchmark with a single layer, achieve best-in-class length generalisation on regular language tasks among parallel-in-time models, and match the state-of-the-art performance of log neural controlled differential equations.
arXiv Detail & Related papers (2025-05-23T11:34:21Z)
Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations [10.851383867834052]
We compute a dense linear RNN as the fixed-point of a parallelizable diagonal linear RNN in a single layer.<n>We achieve state-of-the-art results on the commonly used toy tasks $A_5$, $S_5$, copying, and modular arithmetics.
arXiv Detail & Related papers (2025-03-13T18:50:22Z)
GL-Fusion: Rethinking the Combination of Graph Neural Network and Large Language model [63.774726052837266]
We introduce a new architecture that deeply integrates Graph Neural Networks (GNNs) with Large Language Models (LLMs)<n>We introduce three key innovations: (1) Structure-Aware Transformers, which incorporate GNN's message-passing capabilities directly into LLM's transformer layers; (2) Graph-Text Cross-Attention, which processes full, uncompressed text from graph nodes and edges; and (3) GNN-LLM Twin Predictor, enabling LLM's flexible autoregressive generation alongside GNN's scalable one-pass prediction.
arXiv Detail & Related papers (2024-12-08T05:49:58Z)
xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism [5.704297874096985]
Diffusion models are pivotal for generating high-quality images and videos. This paper introduces xDiT, a comprehensive parallel inference engine for DiTs. Notably, we are the first to demonstrate DiTs scalability on Ethernet-connected GPU clusters.
arXiv Detail & Related papers (2024-11-04T01:40:38Z)
Simple Multigraph Convolution Networks [49.19906483875984]
Existing multigraph convolution methods either ignore the cross-view interaction among multiple graphs, or induce extremely high computational cost due to standard cross-view operators. This paper proposes a Simple Multi Convolution Networks (SMGCN) which first extracts consistent cross-view topology from multigraphs including edge-level and subgraph-level topology, then performs expansion based on raw multigraphs and consistent topologies. In theory, SMGCN utilizes the consistent topologies in expansion rather than standard cross-view expansion, which performs credible cross-view spatial message-passing, and effectively reduces the complexity of standard expansion.
arXiv Detail & Related papers (2024-03-08T03:27:58Z)
ARNN: Attentive Recurrent Neural Network for Multi-channel EEG Signals to Identify Epileptic Seizures [2.3907933297014927]
An Attention Recurrent Neural Network (ARNN) is proposed that can process a large amount of data efficiently and accurately. ARNN cell recurrently applies attention layers along a sequence and has linear complexity with the sequence length. This framework is inspired in part by the attention layer and long short-term memory (LSTM) cells, but it scales this typical cell up by several orders to parallelize for multi-channel EEG signals.
arXiv Detail & Related papers (2024-03-05T19:15:17Z)
Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z)
Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision. A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive. We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z)
Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence. Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
A Linear Algebraic Approach to Model Parallelism in Deep Learning [0.0]
Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity. We propose a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the DNN. We build distributed DNN layers using these parallel primitives, composed with sequential layer implementations, and demonstrate their application by building and training a distributed DNN using DistDL, a PyTorch and MPI-based distributed deep learning toolkit.
arXiv Detail & Related papers (2020-06-04T19:38:05Z)
Locality Sensitive Hashing-based Sequence Alignment Using Deep Bidirectional LSTM Models [0.0]
Bidirectional Long Short-Term Memory (LSTM) is a special kind of Recurrent Neural Network (RNN) architecture. This paper proposes to use deep bidirectional LSTM for sequence modeling as an approach to perform locality-sensitive hashing (LSH)-based sequence alignment.
arXiv Detail & Related papers (2020-04-05T05:13:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.