TS-Agent: A Time Series Reasoning Agent with Iterative Statistical Insight Gathering
- URL: http://arxiv.org/abs/2510.07432v1
- Date: Wed, 08 Oct 2025 18:31:53 GMT
- Title: TS-Agent: A Time Series Reasoning Agent with Iterative Statistical Insight Gathering
- Authors: Penghang Liu, Elizabeth Fons, Svitlana Vyetrenko, Daniel Borrajo, Vamsi Potluru, Manuela Veloso,
- Abstract summary: We propose TS-Agent, a time series reasoning agent for large language models (LLMs)<n>Instead of mapping time series into text tokens, images, or embeddings, our agent interacts with raw numeric sequences through atomic operators.<n>Our experiments show that TS-Agent achieves performance comparable to state-of-the-art LLMs on understanding benchmarks.
- Score: 16.95452463476229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown strong abilities in reasoning and problem solving, but recent studies reveal that they still struggle with time series reasoning tasks, where outputs are often affected by hallucination or knowledge leakage. In this work we propose TS-Agent, a time series reasoning agent that leverages LLMs strictly for what they excel at, i.e., gathering evidence and synthesizing it into conclusions through step-by-step reasoning, while delegating the extraction of statistical and structural information to time series analytical tools. Instead of mapping time series into text tokens, images, or embeddings, our agent interacts with raw numeric sequences through atomic operators, records outputs in an explicit evidence log, and iteratively refines its reasoning under the guidance of a self-critic and a final quality gate. This design avoids multi-modal alignment training, preserves the native form of time series, ensures interpretability and verifiability, and mitigates knowledge leakage or hallucination. Empirically, we evaluate the agent on established benchmarks. Our experiments show that TS-Agent achieves performance comparable to state-of-the-art LLMs on understanding benchmarks, and delivers significant improvements on reasoning tasks, where existing models often rely on memorization and fail in zero-shot settings.
Related papers
- Agentic Spatio-Temporal Grounding via Collaborative Reasoning [80.83158605034465]
Temporal Video Grounding aims to retrieve thetemporal tube of a target object or person in a video given a text query.<n>We propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario.<n>Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs)<n>Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin
arXiv Detail & Related papers (2026-02-10T10:16:27Z) - TSAQA: Time Series Analysis Question And Answering Benchmark [85.35545785252309]
Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science.<n>We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities.
arXiv Detail & Related papers (2026-01-30T17:28:56Z) - Lost in the Noise: How Reasoning Models Fail with Contextual Distractors [57.31788955167306]
Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information.<n>We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks.<n>Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors.
arXiv Detail & Related papers (2026-01-12T05:43:51Z) - NUM2EVENT: Interpretable Event Reasoning from Numerical time-series [6.45945124018154]
We introduce the task of number-to-event reasoning and decoding, which aims to infer interpretable structured events from numerical inputs.<n>To address the data scarcity and semantic alignment challenges, we propose a reasoning-aware framework.<n>Our model explicitly reasons over numerical changes, generates intermediate explanations, and outputs structured event hypotheses.
arXiv Detail & Related papers (2025-10-24T02:57:11Z) - Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning [2.426309874608745]
Complex numerical time series analysis often demands multi-step reasoning capabilities beyond current models' reach.<n>We introduce Chain Of thought for Understanding Numerical Time Series (COUNTS), the first framework that trains large language models to perform Chain-of-Thought (CoT) reasoning across diverse time series tasks using reinforcement learning (RL) with verifiable rewards.<n>Our experiments demonstrate that this RL-driven approach with intermediate CoT reasoning significantly enhances LLM performance across various time series analysis tasks, opening new possibilities for complex temporal data reasoning.
arXiv Detail & Related papers (2025-10-01T17:02:28Z) - AXIS: Explainable Time Series Anomaly Detection with Large Language Models [33.68487894996624]
AXIS is a framework that conditions a frozen Large Language Models (LLMs) for nuanced time-series understanding.<n>LLMs operate on discrete tokens and struggle to directly process long, continuous signals.<n>We introduce a new benchmark featuring multi-format questions and rationales that supervise contextual grounding and pattern-level semantics.
arXiv Detail & Related papers (2025-09-29T07:24:22Z) - GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments [56.007498767771075]
GSM-Agent is a novel benchmark for evaluating agentic reasoning in complex environments.<n>We analyze the agentic reasoning patterns by cluster the environment's document embeddings into nodes, and map each tool call to its nearest node.<n>We propose a tool-augmented test-time scaling method to improve LLM's agentic reasoning performance by adding tools to encourage models to revisit.
arXiv Detail & Related papers (2025-09-26T07:24:37Z) - Enhancing LLM Reasoning for Time Series Classification by Tailored Thinking and Fused Decision [8.256998757769322]
ReasonTSC is a framework designed to leverage LLM reasoning for time series classification.<n>It steers the model to think over the essential characteristics of time series data.<n>It integrates predictions and confidence scores from plug-in classifiers, e.g., domain-specific time series models, as in-context examples.
arXiv Detail & Related papers (2025-06-01T03:15:54Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models [21.579319926212296]
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks.<n>They struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships.<n>We introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection.
arXiv Detail & Related papers (2025-04-07T16:51:45Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.