Fugu-MT 論文翻訳(概要): TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

論文の概要: TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

arxiv url: http://arxiv.org/abs/2606.01498v1
Date: Sun, 31 May 2026 23:34:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.736976
Title: TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning
Title（参考訳）: TimeSage-MT:エージェント時系列推論評価のためのマルチTurnベンチマーク
Authors: Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li, Yilei Shao, Stefan Zohren, Anna Vettoruzzo, Joaquin Vanschoren, Ming Jin, Qingsong Wen,
Abstract要約: 時系列データは、多くの現実世界のドメインに対して重要な決定を通知する。大規模言語モデル (LLM) エージェントがマルチターン会話を通して信頼できる時系列解析を行うことができるかどうかは不明である。 TimeSage-MTは、240のタスクと2,680の対話が8つの現実世界のドメインにまたがるエージェント時系列推論のベンチマークである。
参考スコア（独自算出の注目度）: 44.68126840122709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark's utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.
Abstract（参考訳）: 時系列データは、多くの現実世界のドメインに対して重要な決定を通知する。大規模言語モデル(LLM)エージェントは、自然言語やツールを通じてデータを分析できるが、マルチターン会話を通して信頼できる時系列分析を行うことができるかどうかは不明だ。既存のベンチマークでは、予測や異常検出といった単一ステップのタスク、ユーザの目標が進化する実践的なワークフローを見渡すこと、エージェントは事前分析に基づいて構築する必要があること、蓄積されたエビデンスから結論が現れることなどに重点を置いている。本研究では,240のタスクと2,680の対話を伴うエージェント時系列推論のためのマルチターンベンチマークであるTimeSage-MTを紹介する。 TimeSage-MTは、実世界の時系列データを検証可能な回答を伴うマルチターン会話に変換する再現可能なパイプラインを通じて構築される。時系列エージェントシステムを比較するための統一評価プロトコルと公開リーダボードを提供する。ベンチマークの有用性を示すため,時系列スキルライブラリを備えた新しい構造化エージェントであるTimeSageとともに,フロンティアLLMを評価した。結果は、メモリの障害、不確実性処理、ドメインベースの意思決定などによって引き起こされる、意思決定指向タスクのパフォーマンスが急落したことを示している。 TimeSage-MTは、現在のエージェント推論における重要なギャップを明らかにし、将来の開発に厳格な基盤を提供する。

論文の概要: TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

関連論文リスト