TS-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models
- URL: http://arxiv.org/abs/2602.14200v1
- Date: Sun, 15 Feb 2026 15:50:02 GMT
- Title: TS-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models
- Authors: Nicolas Zumarraga, Thomas Kaar, Ning Wang, Maxwell A. Xu, Max Rosenblattl, Markus Kreft, Kevin O'Sullivan, Paul Schmiedmayer, Patrick Langer, Robert Jakob,
- Abstract summary: Time Series Language Models (TSLMs) are emerging as unified models for reasoning over continuous signals in natural language.<n>Existing models are typically trained and evaluated on short sequences, while real-world time-series sensor streams can span millions of datapoints.<n>We introduce TS-Haystack, a long-context temporal retrieval benchmark comprising ten task types across four categories.
- Score: 4.387988928531881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Time Series Language Models (TSLMs) are emerging as unified models for reasoning over continuous signals in natural language. However, long-context retrieval remains a major limitation: existing models are typically trained and evaluated on short sequences, while real-world time-series sensor streams can span millions of datapoints. This mismatch requires precise temporal localization under strict computational constraints, a regime that is not captured by current benchmarks. We introduce TS-Haystack, a long-context temporal retrieval benchmark comprising ten task types across four categories: direct retrieval, temporal reasoning, multi-step reasoning and contextual anomaly. The benchmark uses controlled needle insertion by embedding short activity bouts into longer longitudinal accelerometer recordings, enabling systematic evaluation across context lengths ranging from seconds to 2 hours per sample. We hypothesize that existing TSLM time series encoders overlook temporal granularity as context length increases, creating a task-dependent effect: compression aids classification but impairs retrieval of localized events. Across multiple model and encoding strategies, we observe a consistent divergence between classification and retrieval behavior. Learned latent compression preserves or improves classification accuracy at compression ratios up to 176$\times$, but retrieval performance degrades with context length, incurring in the loss of temporally localized information. These results highlight the importance of architectural designs that decouple sequence length from computational complexity while preserving temporal fidelity.
Related papers
- LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge [31.40589987269264]
We present LiveSearchBench, an automated pipeline for constructing retrieval-dependent benchmarks from recent knowledge updates.<n>Our method computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three levels of reasoning difficulty.<n> Experiments show a pronounced performance drop when models confront facts that post-date pretraining, with the gap most salient on multi-hop queries.
arXiv Detail & Related papers (2025-11-03T10:00:49Z) - CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models [11.167804698594866]
We present CMT-Bench, a diagnostic benchmark built from live cricket commentary.<n>We find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes.
arXiv Detail & Related papers (2025-10-20T23:51:28Z) - TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval [32.06255656982559]
TRACE is a generic multimodal retriever that grounds time-series embeddings in aligned textual context.<n>It supports flexible cross-modal retrieval modes, including Text-to-Timeseries and Timeseries-to-Text.<n>TRACE also serves as a powerful standalone encoder, with lightweight task-specific tuning that refines context-aware representations.
arXiv Detail & Related papers (2025-06-10T17:59:56Z) - Timer-XL: Long-Context Transformers for Unified Time Series Forecasting [67.83502953961505]
We present Timer-XL, a causal Transformer for unified time series forecasting.<n>Based on large-scale pre-training, Timer-XL achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-10-07T07:27:39Z) - Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding [57.62275091656578]
We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE)
This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE.
arXiv Detail & Related papers (2024-06-04T16:42:17Z) - TimeSiam: A Pre-Training Framework for Siamese Time-Series Modeling [67.02157180089573]
Time series pre-training has recently garnered wide attention for its potential to reduce labeling expenses and benefit various downstream tasks.
This paper proposes TimeSiam as a simple but effective self-supervised pre-training framework for Time series based on Siamese networks.
arXiv Detail & Related papers (2024-02-04T13:10:51Z) - Retrieving Continuous Time Event Sequences using Neural Temporal Point
Processes with Learnable Hashing [24.963828650935913]
We propose NeuroSeqRet, a first-of-its-kind framework designed specifically for end-to-end CTES retrieval.
We develop four variants of the relevance model for different kinds of applications based on the trade-off between accuracy and efficiency.
Our experiments show the significant accuracy boost of NeuroSeqRet as well as the efficacy of our hashing mechanism.
arXiv Detail & Related papers (2023-07-13T18:54:50Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - Once-for-All Sequence Compression for Self-Supervised Speech Models [62.60723685118747]
We introduce a once-for-all sequence compression framework for self-supervised speech models.
The framework is evaluated on various tasks, showing marginal degradation compared to the fixed compressing rate variants.
We also explore adaptive compressing rate learning, demonstrating the ability to select task-specific preferred frame periods without needing a grid search.
arXiv Detail & Related papers (2022-11-04T09:19:13Z) - Continuous Sign Language Recognition via Temporal Super-Resolution
Network [10.920363368754721]
This paper aims at the problem that the spatial-temporal hierarchical continuous sign language recognition model based on deep learning has a large amount of computation.
The data is reconstructed into a dense feature sequence to reduce the overall model while keeping the final recognition accuracy loss to a minimum.
Experiments on two large-scale sign language datasets demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-07-03T00:55:45Z) - Context-aware and Scale-insensitive Temporal Repetition Counting [60.40438811580856]
Temporal repetition counting aims to estimate the number of cycles of a given repetitive action.
Existing deep learning methods assume repetitive actions are performed in a fixed time-scale, which is invalid for the complex repetitive actions in real life.
We propose a context-aware and scale-insensitive framework to tackle the challenges in repetition counting caused by the unknown and diverse cycle-lengths.
arXiv Detail & Related papers (2020-05-18T05:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.