TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks
- URL: http://arxiv.org/abs/2602.13272v1
- Date: Thu, 05 Feb 2026 01:02:19 GMT
- Title: TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks
- Authors: Muyan Weng, Defu Cao, Wei Yang, Yashaswi Sharma, Yan Liu,
- Abstract summary: It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions.<n>We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings.<n>By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns.
- Score: 12.114998959919978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical forecasting accuracy does not reliably translate into robust contextual or event-aware temporal reasoning; instead, existing agent frameworks exhibit fragmented strengths and systematic failure modes that remain largely hidden under forecasting-only benchmarks. The TemporalBench dataset is publicly available at https://huggingface.co/datasets/Melady/TemporalBench, and we additionally provide a public leaderboard at https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard.
Related papers
- It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks [87.7937890373758]
Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation.<n>We introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks.<n>We propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels.
arXiv Detail & Related papers (2026-02-12T16:31:01Z) - From Observations to States: Latent Time Series Forecasting [65.98504021691666]
We propose Latent Time Series Forecasting (LatentTSF), a novel paradigm that shifts TSF from observation regression to latent state prediction.<n>Specifically, LatentTSF employs an AutoEncoder to project observations at each time step into a higher-dimensional latent state space.<n>Our proposed latent objectives implicitly maximize mutual information between predicted latent states and ground-truth states and observations.
arXiv Detail & Related papers (2026-01-30T20:39:44Z) - What If TSF: A Benchmark for Reframing Forecasting as Scenario-Guided Multimodal Forecasting [8.593646221015264]
What If TSF (WIT) is a benchmark designed to evaluate whether models can condition their forecasts on contextual text.<n>WIT offers a rigorous testbed for scenario-guided multimodal forecasting.
arXiv Detail & Related papers (2026-01-13T12:47:43Z) - Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting [3.0354231393746685]
Hierarchical AI-Meteorologist generates explainable weather reports using a hierarchical forecast reasoning and weather keyword generation.<n>Our framework performs multi-scale reasoning across hourly, 6-hour, and daily aggregations to capture both short-term dynamics and long-term trends.
arXiv Detail & Related papers (2025-11-28T17:27:06Z) - Zephyrus: An Agentic Framework for Weather Science [47.611521052984365]
Foundation models for weather science are pre-trained on vast amounts of structured numerical data and outperform traditional weather forecasting systems.<n>Large language models (LLMs) excel at understanding and generating text but cannot reason about high-dimensional meteorological datasets.<n>We bridge this gap by building a novel agentic framework for weather science.<n>We design Zephyrus, a multi-turn LLM-based weather agent that iteratively analyzes weather datasets, observes results, and refines its approach through conversational feedback loops.
arXiv Detail & Related papers (2025-10-05T03:34:08Z) - When Does Multimodality Lead to Better Time Series Forecasting? [96.26052272121615]
We investigate whether and under what conditions such multimodal integration consistently yields gains.<n>Our findings reveal that the benefits of multimodality are highly condition-dependent.<n>Our study offers a rigorous, quantitative foundation for understanding when multimodality can be expected to aid forecasting tasks.
arXiv Detail & Related papers (2025-06-20T23:55:56Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Temporal Validity Change Prediction [20.108317515225504]
Existing benchmarking tasks require models to identify the temporal validity duration of a single statement.
In many cases, additional contextual information, such as sentences in a story or posts on a social media profile, can be collected from the available text stream.
We propose Temporal Validity Change Prediction, a natural language processing task benchmarking the capability of machine learning models to detect contextual statements that induce such change.
arXiv Detail & Related papers (2024-01-01T14:58:53Z) - Streaming Motion Forecasting for Autonomous Driving [71.7468645504988]
We introduce a benchmark that queries future trajectories on streaming data and we refer to it as "streaming forecasting"
Our benchmark inherently captures the disappearance and re-appearance of agents, which is a safety-critical problem yet overlooked by snapshot-based benchmarks.
We propose a plug-and-play meta-algorithm called "Predictive Streamer" that can adapt any snapshot-based forecaster into a streaming forecaster.
arXiv Detail & Related papers (2023-10-02T17:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.