ScaleFormer: Span Representation Cumulation for Long-Context Transformer
- URL: http://arxiv.org/abs/2511.10029v1
- Date: Fri, 14 Nov 2025 01:27:03 GMT
- Title: ScaleFormer: Span Representation Cumulation for Long-Context Transformer
- Authors: Jiangshu Du, Wenpeng Yin, Philip Yu,
- Abstract summary: We propose a plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences.<n>Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder.<n> Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches.
- Score: 9.845891949404534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.
Related papers
- Stacked from One: Multi-Scale Self-Injection for Context Window Extension [69.24689919827817]
modelname is a novel framework based on multi-grained context compression and query-aware information acquisition.<n>modelnameachieves performance superior or comparable to strong baselines.
arXiv Detail & Related papers (2026-03-05T03:16:16Z) - Conv-like Scale-Fusion Time Series Transformer: A Multi-Scale Representation for Variable-Length Long Time Series [10.93942806756288]
Transformer-based models have advanced time series tasks, but struggle with feature redundancy and limited generalization capabilities.<n>We propose a Multi-Scale Representation Learning Framework based on a Conv-like ScaleFusion Transformer.<n>Our framework achieves superior feature independence, reduced redundancy, and better performance in forecasting and classification tasks compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-09-22T14:37:59Z) - LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs.
With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z) - Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided
Dynamic Token Merge for Document Understanding [40.322453628755376]
General efficient transformers are challenging to be directly adapted to model document.
Fast-StrucTexT is an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture.
Our model achieves the state-of-the-art performance and almost 1.9X faster inference time than the state-of-the-art methods.
arXiv Detail & Related papers (2023-05-19T02:42:35Z) - ChunkFormer: Learning Long Time Series with Multi-stage Chunked
Transformer [0.0]
Original Transformer-based models adopt an attention mechanism to discover global information along a sequence.
ChunkFormer splits the long sequences into smaller sequence chunks for the attention calculation.
In this way, the proposed model gradually learns both local and global information without changing the total length of the input sequences.
arXiv Detail & Related papers (2021-12-30T15:06:32Z) - Beyond Self Attention: A Subquadratic Fourier Wavelet Transformer with Multi Modal Fusion [0.0]
We revisit the use of spectral techniques to replace the attention mechanism in Transformers.<n>We present a comprehensive and novel reformulation of this technique in next generation transformer models.
arXiv Detail & Related papers (2021-11-25T18:03:41Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Long-Span Dependencies in Transformer-based Summarization Systems [38.672160430296536]
Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization.
One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows.
In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization.
arXiv Detail & Related papers (2021-05-08T23:53:03Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z) - Cascaded Text Generation with Markov Transformers [122.76100449018061]
Two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies.
This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output.
This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.
arXiv Detail & Related papers (2020-06-01T17:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.