Related papers: Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models

Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models

URL: http://arxiv.org/abs/2511.11622v1
Date: Thu, 06 Nov 2025 20:16:21 GMT
Title: Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models
Authors: Alexis Roger, Gwen Legate, Kashif Rasul, Yuriy Nevmyvaka, Irina Rish,
Abstract summary: We show that tokenizer configuration governs the representational capacity and stability of the model.<n>We demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively.<n> misaligned tokenization can diminish or even invert the benefits of pretraining.
Score: 20.41613649587587
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokenization and transfer learning are two critical components in building state of the art time series foundation models for forecasting. In this work, we systematically study the effect of tokenizer design, specifically scaling and quantization strategies, on model performance, alongside the impact of pretraining versus random initialization. We show that tokenizer configuration primarily governs the representational capacity and stability of the model, while transfer learning influences optimization efficiency and alignment. Using a combination of empirical training experiments and theoretical analyses, we demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively, particularly at smaller vocabulary sizes. Conversely, misaligned tokenization can diminish or even invert the benefits of pretraining. These findings highlight the importance of careful tokenization in time series modeling and suggest that combining small, efficient vocabularies with pretrained weights is especially advantageous in multi-modal forecasting settings, where the overall vocabulary must be shared across modalities. Our results provide concrete guidance for designing tokenizers and leveraging transfer learning in discrete representation learning for continuous signals.

Related papers

LLM-Inspired Pretrain-Then-Finetune for Small-Data, Large-Scale Optimization [7.8639568562295965]
We consider small-data, large-scale decision problems in which a firm must make many operational decisions simultaneously.<n>We propose a pretrain-then-finetune approach built on a designed Transformer model to address this challenge.
arXiv Detail & Related papers (2026-02-03T16:08:33Z)
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design [21.9940001977516]
Existing tokenizers exhibit high token-to-word ratios, inefficient use of context length, and slower inference.<n>We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality.<n>Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models.
arXiv Detail & Related papers (2025-08-03T15:31:10Z)
DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding [26.39397960987363]
We propose a simple modification to pretrained transformer models.<n>Instead of concatenation with the language prompt at the start, we insert multimodal tokens directly into the middle.<n>Our results indicate that our method reduces computational costs during both training and inference.
arXiv Detail & Related papers (2025-04-27T18:56:26Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Scaling LLM Pre-training with Vocabulary Curriculum [0.0]
We introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size.<n>Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities.<n>Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization.
arXiv Detail & Related papers (2025-02-25T07:18:29Z)
Improving Network Interpretability via Explanation Consistency Evaluation [56.14036428778861]
We propose a framework that acquires more explainable activation heatmaps and simultaneously increase the model performance. Specifically, our framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations.
arXiv Detail & Related papers (2024-08-08T17:20:08Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning [75.68193159293425]
In-context learning (ICL) allows transformer-based language models to learn a specific task with a few "task demonstrations" without updating their parameters.<n>We propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL.<n>We experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
arXiv Detail & Related papers (2024-05-22T15:52:52Z)
Collaborative decoding of critical tokens for boosting factuality of large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation. The common practice of using sampling during generation also increases chances of hallucination. We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.