When LLM Meets Time Series: Can LLMs Perform Multi-Step Time Series Reasoning and Inference
- URL: http://arxiv.org/abs/2509.01822v1
- Date: Mon, 01 Sep 2025 22:58:57 GMT
- Title: When LLM Meets Time Series: Can LLMs Perform Multi-Step Time Series Reasoning and Inference
- Authors: Wen Ye, Jinbo Liu, Defu Cao, Wei Yang, Yan Liu,
- Abstract summary: We introduce the TSAIA Benchmark, a first attempt to evaluate Large Language Models as time-series AI assistants.<n>The benchmark encompasses a broad spectrum of challenges, ranging from constraint-aware forecasting to anomaly detection with threshold calibration.<n>We apply this benchmark to assess eight state-of-the-art LLMs under a unified evaluation protocol.
- Score: 12.867006554196358
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of Large Language Models (LLMs) has sparked growing interest in their application to time series analysis tasks. However, their ability to perform complex reasoning over temporal data in real-world application domains remains underexplored. To move toward this goal, a first step is to establish a rigorous benchmark dataset for evaluation. In this work, we introduce the TSAIA Benchmark, a first attempt to evaluate LLMs as time-series AI assistants. To ensure both scientific rigor and practical relevance, we surveyed over 20 academic publications and identified 33 real-world task formulations. The benchmark encompasses a broad spectrum of challenges, ranging from constraint-aware forecasting to anomaly detection with threshold calibration: tasks that require compositional reasoning and multi-step time series analysis. The question generator is designed to be dynamic and extensible, supporting continuous expansion as new datasets or task types are introduced. Given the heterogeneous nature of the tasks, we adopt task-specific success criteria and tailored inference-quality metrics to ensure meaningful evaluation for each task. We apply this benchmark to assess eight state-of-the-art LLMs under a unified evaluation protocol. Our analysis reveals limitations in current models' ability to assemble complex time series analysis workflows, underscoring the need for specialized methodologies for domain-specific adaptation. Our benchmark is available at https://huggingface.co/datasets/Melady/TSAIA, and the code is available at https://github.com/USC-Melady/TSAIA.
Related papers
- GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z) - TSAQA: Time Series Analysis Question And Answering Benchmark [85.35545785252309]
Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science.<n>We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities.
arXiv Detail & Related papers (2026-01-30T17:28:56Z) - Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback [55.284574165467525]
Time-series Reasoning for Anomaly (Time-RA) transforms classical time series anomaly detection into a generative, reasoning-intensive task.<n>Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning.
arXiv Detail & Related papers (2025-07-20T18:02:50Z) - TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents [17.296425855109426]
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents.<n>TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks.<n>We implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models.
arXiv Detail & Related papers (2025-05-19T16:11:23Z) - Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data [1.2979906794584584]
The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored.<n>In this paper we work on this topic, focusing on structured and semi-structured anonymized data.<n>We identify and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components.
arXiv Detail & Related papers (2025-04-10T10:48:42Z) - Are Large Language Models Useful for Time Series Data Analysis? [3.44393516559102]
Time series data plays a critical role across diverse domains such as healthcare, energy, and finance.<n>This study investigates whether large language models (LLMs) are effective for time series data analysis.
arXiv Detail & Related papers (2024-12-16T02:47:44Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - DCA-Bench: A Benchmark for Dataset Curation Agents [9.60250892491588]
Data quality issues, such as incomplete documentation, inaccurate labels, ethical concerns, and outdated information, remain common in widely used datasets.<n>With the surging ability of large language models (LLM), it's promising to streamline the discovery of hidden dataset issues with LLM agents.<n>In this work, we establish a benchmark to measure LLM agent's ability to tackle this challenge.
arXiv Detail & Related papers (2024-06-11T14:02:23Z) - Empowering Time Series Analysis with Large Language Models: A Survey [24.202539098675953]
We provide a systematic overview of methods that leverage large language models for time series analysis.
Specifically, we first state the challenges and motivations of applying language models in the context of time series.
Next, we categorize existing methods into different groups (i.e., direct query, tokenization, prompt design, fine-tune, and model integration) and highlight the key ideas within each group.
arXiv Detail & Related papers (2024-02-05T16:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.