Related papers: Moirai 2.0: When Less Is More for Time Series Forecasting

Moirai 2.0: When Less Is More for Time Series Forecasting

URL: http://arxiv.org/abs/2511.11698v1
Date: Wed, 12 Nov 2025 12:15:35 GMT
Title: Moirai 2.0: When Less Is More for Time Series Forecasting
Authors: Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Doyen Sahoo, Caiming Xiong, Silvio Savarese, Junnan Li,
Abstract summary: Moirai 2.0 is a decoder-only foundation model trained on a new corpus of 36M series.<n>It ranks among the top pretrained models while achieving a strong trade-off between accuracy, speed, and model size.<n>In terms of efficiency and model size, Moirai 2.0 is twice as fast and thirty times smaller than its prior best version, Moirai 1.0-Large.
Score: 91.36760228926214
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Moirai 2.0, a decoder-only time-series foundation model trained on a new corpus of 36M series. The model adopts quantile forecasting and multi-token prediction, improving both probabilistic accuracy and inference efficiency. On the Gift-Eval benchmark, it ranks among the top pretrained models while achieving a strong trade-off between accuracy, speed, and model size. Compared to Moirai 1.0, Moirai 2.0 replaces masked-encoder training, multi-patch inputs, and mixture-distribution outputs with a simpler decoder-only architecture, single patch, and quantile loss. Ablation studies isolate these changes -- showing that the decoder-only backbone along with recursive multi-quantile decoding contribute most to the gains. Additional experiments show that Moirai 2.0 outperforms larger models from the same family and exhibits robust domain-level results. In terms of efficiency and model size, Moirai 2.0 is twice as fast and thirty times smaller than its prior best version, Moirai 1.0-Large, while also performing better. Model performance plateaus with increasing parameter count and declines at longer horizons, motivating future work on data scaling and long-horizon modeling. We release code and evaluation details to support further research.

Related papers

Sequential Diffusion Language Models [110.06562906987052]
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value caches.<n>We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction.<n>We propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost.
arXiv Detail & Related papers (2025-09-28T17:59:15Z)
Seq vs Seq: An Open Suite of Paired Encoders and Decoders [37.62535961965971]
We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion.<n>Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes.<n>We show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective.
arXiv Detail & Related papers (2025-07-15T15:31:51Z)
Training and Inference Efficiency of Encoder-Decoder Speech Models [25.031622057759492]
We focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently.<n>We show that negligence in mini-batch sampling leads to more than 50% being spent on padding.<n>We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup.
arXiv Detail & Related papers (2025-03-07T20:57:43Z)
Scaling Inference-Efficient Language Models [3.271571137474847]
We show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency.<n>We modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture.<n>We release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks.
arXiv Detail & Related papers (2025-01-30T03:16:44Z)
A Mamba Foundation Model for Time Series Forecasting [13.593170999506889]
We introduce TSMamba, a linear-complexity foundation model for time series forecasting built on the Mamba architecture. The model captures temporal dependencies through both forward and backward Mamba encoders, achieving high prediction accuracy. It also achieves competitive or superior full-shot performance compared to task-specific prediction models.
arXiv Detail & Related papers (2024-11-05T09:34:05Z)
A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.<n>We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z)
Tandem Transformers for Inference Efficient LLMs [49.75726447408795]
We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines a small autoregressive model and a large model operating in block mode. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy.
arXiv Detail & Related papers (2024-02-13T18:24:08Z)
Online Speculative Decoding [34.987825705622555]
We introduce online speculative decoding to accelerate the inference of large language models. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data.
arXiv Detail & Related papers (2023-10-11T04:03:42Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking. We finetune a pretrained encoder-decoder model using in the form of document to query generation. We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.