Related papers: Efficient Time Series Processing for Transformers and State-Space Models through Token Merging

Efficient Time Series Processing for Transformers and State-Space Models through Token Merging

URL: http://arxiv.org/abs/2405.17951v1
Date: Tue, 28 May 2024 08:28:18 GMT
Title: Efficient Time Series Processing for Transformers and State-Space Models through Token Merging
Authors: Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn,
Abstract summary: Token merging has shown to considerably improve the throughput of vision transformer architectures. We introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood. On the recently proposed Chronos foundation model, we achieve accelerations up to 5400% with only minor accuracy degradations.
Score: 44.27818172708914
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer architectures have shown promising results in time series processing. However, despite recent advances in subquadratic attention mechanisms or state-space models, processing very long sequences still imposes significant computational requirements. Token merging, which involves replacing multiple tokens with a single one calculated as their linear combination, has shown to considerably improve the throughput of vision transformer architectures while maintaining accuracy. In this work, we go beyond computer vision and perform the first investigations of token merging in time series analysis on both time series transformers and state-space models. To effectively scale token merging to long sequences, we introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood, adjusting the computational complexity from linear to quadratic based on the neighborhood size. Our comprehensive empirical evaluation demonstrates that token merging offers substantial computational benefits with minimal impact on accuracy across various models and datasets. On the recently proposed Chronos foundation model, we achieve accelerations up to 5400% with only minor accuracy degradations.

Related papers

Modality Agnostic Efficient Long Range Encoder [14.705955027331674]
We address the challenge of long-context processing on a single device using generic implementations.<n>To overcome these limitations, we propose MAELRE, a unified and efficient transformer architecture.<n>We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models.
arXiv Detail & Related papers (2025-07-25T16:19:47Z)
Local Representative Token Guided Merging for Text-to-Image Generation [26.585985828583304]
Local representative token guided merging (ReToM) is a novel token merging strategy applicable to any attention mechanism in image generation.<n> Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline.
arXiv Detail & Related papers (2025-07-17T04:16:24Z)
Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive [58.0729162588429]
Interactive segmentation improves annotation efficiency by segmenting target regions from user prompts.<n>Current approaches face a critical trade-off: dense-token methods achieve superior accuracy but suffer from prohibitively slow processing on CPU devices.<n>We propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing.
arXiv Detail & Related papers (2025-07-13T12:33:37Z)
FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Object Detection [9.291995455336929]
We propose a Focal Token Acquring-and-Scaling Transformer (FASTer)<n>FASTer condenses token sequences in an adaptive and lightweight manner.<n>It significantly outperforms other state-of-the-art detectors in both performance and efficiency.
arXiv Detail & Related papers (2025-02-28T03:15:33Z)
MATEY: multiscale adaptive foundation models for spatiotemporal physical systems [2.7767126393602726]
We propose two adaptive tokenization schemes that dynamically adjust patch sizes based on local features. We evaluate the performance of a proposed multiscale adaptive model, MATEY, in a sequence of experiments. We also demonstrate fine-tuning tasks featuring different physics that models pretrained on PDE data.
arXiv Detail & Related papers (2024-12-29T22:13:16Z)
Attamba: Attending To Multi-Token States [6.5676809841642125]
We introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling.
arXiv Detail & Related papers (2024-11-26T18:52:06Z)
Rough Transformers: Lightweight Continuous-Time Sequence Modelling with Path Signatures [46.58170057001437]
We introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts.
arXiv Detail & Related papers (2024-05-31T14:00:44Z)
Leveraging 2D Information for Long-term Time Series Forecasting with Vanilla Transformers [55.475142494272724]
Time series prediction is crucial for understanding and forecasting complex dynamics in various domains. We introduce GridTST, a model that combines the benefits of two approaches using innovative multi-directional attentions. The model consistently delivers state-of-the-art performance across various real-world datasets.
arXiv Detail & Related papers (2024-05-22T16:41:21Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear. A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters. Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z)
Robust representations of oil wells' intervals via sparse attention mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers) The focus in our experiments is on oil&gas data, namely, well logs. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z)
Focus on Local: Detecting Lane Marker from Bottom Up via Key Point [10.617793053931964]
We propose a novel lane marker detection solution, FOLOLane, that focuses on modeling local patterns and achieving prediction of global structures. Specifically, the CNN models lowcomplexity local patterns with two separate heads, the first one predicts the existence of key points, and the second refines the location of key points in the local range and correlates key points of the same lane line.
arXiv Detail & Related papers (2021-05-28T08:59:14Z)
FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data. To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z)
Second-Order Guarantees in Centralized, Federated and Decentralized Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts. Several works have pursued rigorous analytical justification for studying non optimization problems. A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.