Related papers: RRWKV: Capturing Long-range Dependencies in RWKV

Related papers

KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z)
Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache [67.47789629197857]
We propose a training-free framework that exploits the heterogeneous roles of transformer head dimensions.<n>By projecting the long-context-insensitive dimensions onto Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients.<n>We show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack.
arXiv Detail & Related papers (2025-06-13T15:35:54Z)
Sentinel: Multi-Patch Transformer with Temporal and Channel Attention for Time Series Forecasting [48.52101281458809]
Transformer-based time series forecasting has recently gained strong interest due to the ability of transformers to model sequential data. We propose Sentinel, a transformer-based architecture composed of an encoder able to extract contextual information from the channel dimension. We introduce a multi-patch attention mechanism, which leverages the patching process to structure the input sequence in a way that can be naturally integrated into the transformer architecture.
arXiv Detail & Related papers (2025-03-22T06:01:50Z)
Tensor Product Attention Is All You Need [54.40495407154611]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly. TPA achieves improved model quality alongside memory efficiency. We introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z)
A Survey of RWKV [16.618320854505786]
Receptance Weighted Key Value (RWKV) model offers a novel alternative to the Transformer architecture. Unlike conventional Transformers, which depend heavily on self-attention, RWKV adeptly captures long-range dependencies with minimal computational demands. This paper seeks to fill this gap as the first comprehensive review of the RWKV architecture, its core principles, and its varied applications.
arXiv Detail & Related papers (2024-12-19T13:39:24Z)
Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling [2.9228447484533695]
The Transformer architecture has revolutionized the Natural Language Processing field and is the backbone of Large Language Models (LLMs) One of the challenges in the Transformer architecture is the quadratic complexity of the attention mechanism that prohibits the efficient processing of long sequence lengths. One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance while reducing the computation complexity.
arXiv Detail & Related papers (2024-12-08T23:41:38Z)
RecurFormer: Not All Transformer Heads Need Self-Attention [14.331807060659902]
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference. We propose RecurFormer, a novel architecture that replaces certain attention heads with linear recurrent neural networks.
arXiv Detail & Related papers (2024-10-10T15:24:12Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction. We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences. We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z)
An All-MLP Sequence Modeling Architecture That Excels at Copying [6.824179106436217]
We present an all-MLP sequence modeling architecture that can match Transformers on the copying task. In ablation study, we found both exponential activation and pre-activation normalization are indispensable for Transformer-level copying.
arXiv Detail & Related papers (2024-06-23T17:19:26Z)
Folded Context Condensation in Path Integral Formalism for Infinite Context Transformers [0.0]
We present a generalized formulation of the Transformer algorithm by reinterpreting its core mechanisms within the framework of Path Integral formalism. We obtain a more compact and efficient representation, in which the contextual information of a sequence is condensed into memory-like segments. We validate the effectiveness of this approach through the Passkey retrieval task and a summarization task, demonstrating that the proposed method preserves historical information while exhibiting memory usage that scales linearly with sequence length.
arXiv Detail & Related papers (2024-05-07T19:05:26Z)
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures [99.20299078655376]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field. Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Resource-Efficient Separation Transformer [14.666016177212837]
This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings.
arXiv Detail & Related papers (2022-06-19T23:37:24Z)
ReconFormer: Accelerated MRI Reconstruction Using Recurrent Transformer [60.27951773998535]
We propose a recurrent transformer model, namely textbfReconFormer, for MRI reconstruction. It can iteratively reconstruct high fertility magnetic resonance images from highly under-sampled k-space data. We show that it achieves significant improvements over the state-of-the-art methods with better parameter efficiency.
arXiv Detail & Related papers (2022-01-23T21:58:19Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.