Related papers: HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens

HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens

URL: http://arxiv.org/abs/2508.01474v1
Date: Sat, 02 Aug 2025 19:50:58 GMT
Title: HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens
Authors: Ivan Karpukhin, Andrey Savchenko,
Abstract summary: We introduce history tokens, a novel concept that facilitates the accumulation of historical information during prediction pretraining.<n>Our approach significantly improves transformer-based models, achieving impressive results in finance, e-commerce, and healthcare tasks.
Score: 1.534667887016089
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning has achieved remarkable success in modeling sequential data, including event sequences, temporal point processes, and irregular time series. Recently, transformers have largely replaced recurrent networks in these tasks. However, transformers often underperform RNNs in classification tasks where the objective is to predict future targets. The reason behind this performance gap remains largely unexplored. In this paper, we identify a key limitation of transformers: the absence of a single state vector that provides a compact and effective representation of the entire sequence. Additionally, we show that contrastive pretraining of embedding vectors fails to capture local context, which is crucial for accurate prediction. To address these challenges, we introduce history tokens, a novel concept that facilitates the accumulation of historical information during next-token prediction pretraining. Our approach significantly improves transformer-based models, achieving impressive results in finance, e-commerce, and healthcare tasks. The code is publicly available on GitHub.

Related papers

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond [17.002793355495136]
We propose the first theoretical explanation of the inefficiency of transformers on TSF tasks.<n>We attribute the mechanism behind it to bf Asymmetric Learning in training attention networks.
arXiv Detail & Related papers (2024-12-08T20:29:06Z)
Looking Beyond The Top-1: Transformers Determine Top Tokens In Order [13.032106683136394]
We analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed. We find that these saturation events happen in order of the corresponding tokens' ranking. We propose an underlying mechanism of task transition for this sequential saturation.
arXiv Detail & Related papers (2024-10-26T16:00:38Z)
PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction. We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences. We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z)
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions. The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z)
GBT: Two-stage transformer framework for non-stationary time series forecasting [3.830797055092574]
We propose GBT, a novel two-stage Transformer framework with Good Beginning. It decouples the prediction process of TSFT into two stages, including Auto-Regression stage and Self-Regression stage. Experiments on seven benchmark datasets demonstrate that GBT outperforms SOTA TSFTs with only canonical attention and convolution.
arXiv Detail & Related papers (2023-07-17T07:55:21Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting. First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals. Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions. Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z)
Ti-MAE: Self-Supervised Masked Time Series Autoencoders [16.98069693152999]
We propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data.
arXiv Detail & Related papers (2023-01-21T03:20:23Z)
Robust representations of oil wells' intervals via sparse attention mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers) The focus in our experiments is on oil&gas data, namely, well logs. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z)
Augmenting Sequential Recommendation with Pseudo-Prior Items via Reversely Pre-training Transformer [61.818320703583126]
Sequential Recommendation characterizes the evolving patterns by modeling item sequences chronologically. Recent developments of transformer inspire the community to design effective sequence encoders. We introduce a new framework for textbfAugmenting textbfSequential textbfRecommendation with textbfPseudo-prior items(ASReP)
arXiv Detail & Related papers (2021-05-02T18:06:23Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.