Related papers: Distilling Time Series Foundation Models for Efficient Forecasting

Distilling Time Series Foundation Models for Efficient Forecasting

URL: http://arxiv.org/abs/2601.12785v1
Date: Mon, 19 Jan 2026 07:32:00 GMT
Title: Distilling Time Series Foundation Models for Efficient Forecasting
Authors: Yuqi Li, Kuiye Ding, Chuanguang Yang, Szu-Yu Chen, Yingli Tian,
Abstract summary: We present DistilTS, the first distillation framework specifically designed for Time Series foundation models (TSFMs)<n>DistilTS addresses two key challenges: (1) task difficulty discrepancy, specific to forecasting, where uniform weighting makes optimization dominated by easier short-term horizons, while long-term horizons receive weaker supervision; and (2) architecture discrepancy, a general challenge in distillation, for which we design an alignment mechanism in the time series forecasting.<n> Experiments on multiple benchmarks demonstrate that DistilTS achieves forecasting performance comparable to full-sized TSFMs, while reducing parameters by up to 1/150 and accelerating inference by up to 6000x.
Score: 28.730703685779186
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Time Series foundation models (TSFMs) deliver strong forecasting performance through large-scale pretraining, but their large parameter sizes make deployment costly. While knowledge distillation offers a natural and effective approach for model compression, techniques developed for general machine learning tasks are not directly applicable to time series forecasting due to the unique characteristics. To address this, we present DistilTS, the first distillation framework specifically designed for TSFMs. DistilTS addresses two key challenges: (1) task difficulty discrepancy, specific to forecasting, where uniform weighting makes optimization dominated by easier short-term horizons, while long-term horizons receive weaker supervision; and (2) architecture discrepancy, a general challenge in distillation, for which we design an alignment mechanism in the time series forecasting. To overcome these issues, DistilTS introduces horizon-weighted objectives to balance learning across horizons, and a temporal alignment strategy that reduces architectural mismatch, enabling compact models. Experiments on multiple benchmarks demonstrate that DistilTS achieves forecasting performance comparable to full-sized TSFMs, while reducing parameters by up to 1/150 and accelerating inference by up to 6000x. Code is available at: https://github.com/itsnotacie/DistilTS-ICASSP2026.

Related papers

FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts [49.9321870703948]
Existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption.<n>We present FaST, an effective and efficient framework based on Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting.<n>FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden.<n>Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs)
arXiv Detail & Related papers (2026-01-08T18:00:58Z)
TS2Vec-Ensemble: An Enhanced Self-Supervised Framework for Time Series Forecasting [1.2461503242570642]
This paper introduces TS2Vec-Ensemble, a novel hybrid framework for time series forecasting.<n>Our approach enhances the powerful, implicitly learned dynamics from a pretrained TS2Vec encoder by fusing them with explicit, engineered time features that encode periodic cycles.<n>The results demonstrate that TS2Vec-Ensemble consistently and significantly outperforms the standard TS2Vec baseline and other state-of-the-art models.
arXiv Detail & Related papers (2025-11-27T12:19:18Z)
SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization [62.958457694151384]
We introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models.<n>In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms.
arXiv Detail & Related papers (2025-10-22T16:11:22Z)
SEMPO: Lightweight Foundation Models for Time Series Forecasting [45.456949943052116]
SEMPO is a lightweight foundation model that requires pretraining on relatively small-scale data, yet exhibits strong general time series forecasting.<n> SEMPO comprises two key modules: 1) energy-aware SpEctral decomposition module, that substantially improves the utilization of pre-training data.<n>Experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios.
arXiv Detail & Related papers (2025-10-22T15:58:44Z)
SVTime: Small Time Series Forecasting Models Informed by "Physics" of Large Vision Model Forecasters [86.38433605933515]
Time series AI is crucial for analyzing dynamic web content.<n>Given their energy-intensive training, inference, and hardware demands, using large models as a one-fits-all solution raises serious concerns about carbon footprint and sustainability.<n>This paper introduces SVTime, a novel Small model inspired by large Vision model (LVM) forecasters for long-term Time series forecasting (LTSF)
arXiv Detail & Related papers (2025-10-10T18:42:23Z)
The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z)
GateTS: Versatile and Efficient Forecasting via Attention-Inspired routed Mixture-of-Experts [0.0]
We propose a model architecture that simplifies the training process for univariate time series forecasting.<n>Our approach combines sparse MoE computation with a novel attention-inspired gating mechanism that replaces the traditional one-layer softmax router.
arXiv Detail & Related papers (2025-08-24T20:39:50Z)
Time-o1: Time-Series Forecasting Needs Transformed Label Alignment [50.54348432664401]
Time-o1 is a transformation-augmented learning objective tailored for time-series forecasting.<n>The central idea is to transform the label sequence into decorrelated components with discriminated significance.<n>Time-o1 achieves state-of-the-art performance and is compatible with various forecast models.
arXiv Detail & Related papers (2025-05-23T13:00:35Z)
Does Scaling Law Apply in Time Series Forecasting? [2.127584662240465]
We propose Alinear, an ultra-lightweight forecasting model that achieves competitive performance using only k-level parameters.<n>Experiments on seven benchmark datasets demonstrate that Alinear consistently outperforms large-scale models.<n>This work challenges the prevailing belief that larger models are inherently better and suggests a paradigm shift toward more efficient time series modeling.
arXiv Detail & Related papers (2025-05-15T11:04:39Z)
SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters [16.966008476215258]
This paper introduces SparseTSF, a novel, extremely lightweight model for Long-term Time Series Forecasting (LTSF) At the heart of SparseTSF lies the Cross-Period Sparse Forecasting technique, which simplifies the forecasting task by decoupling the periodicity and trend in time series data. SparseTSF showcases remarkable generalization capabilities, making it well-suited for scenarios with limited computational resources, small samples, or low-quality data.
arXiv Detail & Related papers (2024-05-02T02:15:23Z)
Timer: Generative Pre-trained Transformers Are Large Time Series Models [83.03091523806668]
This paper aims at the early development of large time series models (LTSM) During pre-training, we curate large-scale datasets with up to 1 billion time points. To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task.
arXiv Detail & Related papers (2024-02-04T06:55:55Z)
Parsimony or Capability? Decomposition Delivers Both in Long-term Time Series Forecasting [46.63798583414426]
Long-term time series forecasting (LTSF) represents a critical frontier in time series analysis. Our study demonstrates, through both analytical and empirical evidence, that decomposition is key to containing excessive model inflation. Remarkably, by tailoring decomposition to the intrinsic dynamics of time series data, our proposed model outperforms existing benchmarks.
arXiv Detail & Related papers (2024-01-22T13:15:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.