Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers
- URL: http://arxiv.org/abs/2506.08641v2
- Date: Wed, 02 Jul 2025 10:32:10 GMT
- Title: Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers
- Authors: Simon Roschmann, Quentin Bouniot, Vasilii Feofanov, Ievgen Redko, Zeynep Akata,
- Abstract summary: We propose Time Vision Transformer (TiViT), a framework that converts time series into images.<n>We show that TiViT achieves state-of-the-art performance on standard time series classification benchmarks.<n>Our findings reveal a new direction for reusing vision representations in a non-visual domain.
- Score: 49.07665715422702
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal a new direction for reusing vision representations in a non-visual domain. Code is available at https://github.com/ExplainableML/TiViT.
Related papers
- TiVy: Time Series Visual Summary for Scalable Visualization [32.33793043326047]
We propose TiVy, a new algorithm that summarizes time series using sequential patterns.<n>We also present an interactive time series visualization that renders large-scale time series in real-time.
arXiv Detail & Related papers (2025-07-25T05:50:01Z) - Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks.<n>We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself.<n>We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z) - Time Series as Images: Vision Transformer for Irregularly Sampled Time
Series [32.99466250557855]
This paper introduces a novel perspective by converting irregularly sampled time series into line graph images.
We then utilize powerful pre-trained vision transformers for time series classification in the same way as image classification.
Remarkably, despite its simplicity, our approach outperforms state-of-the-art specialized algorithms on several popular healthcare and human activity datasets.
arXiv Detail & Related papers (2023-03-01T22:42:44Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z) - Expressing Multivariate Time Series as Graphs with Time Series Attention
Transformer [14.172091921813065]
We propose the Time Series Attention Transformer (TSAT) for multivariate time series representation learning.
Using TSAT, we represent both temporal information and inter-dependencies of time series in terms of edge-enhanced dynamic graphs.
We show that TSAT clearly outerperforms six state-of-the-art baseline methods in various forecasting horizons.
arXiv Detail & Related papers (2022-08-19T12:25:56Z) - HyperTime: Implicit Neural Representation for Time Series [131.57172578210256]
Implicit neural representations (INRs) have recently emerged as a powerful tool that provides an accurate and resolution-independent encoding of data.
In this paper, we analyze the representation of time series using INRs, comparing different activation functions in terms of reconstruction accuracy and training convergence speed.
We propose a hypernetwork architecture that leverages INRs to learn a compressed latent representation of an entire time series dataset.
arXiv Detail & Related papers (2022-08-11T14:05:51Z) - Attention Augmented Convolutional Transformer for Tabular Time-series [0.9137554315375922]
Time-series classification is one of the most frequently performed tasks in industrial data science.
We propose a novel scalable architecture for learning representations from time-series data.
Our proposed model is end-to-end and can handle both categorical and continuous valued inputs.
arXiv Detail & Related papers (2021-10-05T05:20:46Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.