TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
- URL: http://arxiv.org/abs/2601.18744v1
- Date: Mon, 26 Jan 2026 18:04:54 GMT
- Title: TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
- Authors: Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou,
- Abstract summary: The ability to reason over time series is a fundamental skill for generalist models to solve practical problems.<n>To bridge this gap, we introduce TSRBench, a comprehensive benchmark designed to stress-test the full spectrum of time series reasoning capabilities.
- Score: 52.21298691935726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve practical problems. However, this dimension is notably absent from existing benchmarks of generalist models. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluated over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual represenations of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.
Related papers
- On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z) - TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models [105.47481207029047]
We introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series.<n>We also introduce Time Omni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning.
arXiv Detail & Related papers (2025-09-29T13:54:34Z) - seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs [1.0519693622157462]
We introduce seqBench, a benchmark for probing sequential reasoning limits in Large Language Models (LLMs)<n>We find that even top-performing models systematically fail on seqBench's structured reasoning tasks despite minimal search complexity.
arXiv Detail & Related papers (2025-09-21T01:32:13Z) - Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback [55.284574165467525]
Time-series Reasoning for Anomaly (Time-RA) transforms classical time series anomaly detection into a generative, reasoning-intensive task.<n>Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning.
arXiv Detail & Related papers (2025-07-20T18:02:50Z) - TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models [5.6525926183880255]
We introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task.<n>In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds.<n>TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains.
arXiv Detail & Related papers (2025-06-02T05:47:50Z) - MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation [56.87891213797931]
We present MTR-Bench for Large Language Models' Multi-Turn Reasoning evaluation.<n>Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities.<n>MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations.
arXiv Detail & Related papers (2025-05-21T17:59:12Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis [19.649769354503658]
We introduce TS-Reasoner, a Domain-Oriented Time Series Agent that integrates natural language reasoning with precise numerical execution.<n>We evaluate its capabilities through two axes: basic time series understanding and complex multi-step inference.
arXiv Detail & Related papers (2024-10-05T06:04:19Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.