Fugu-MT 論文翻訳(概要): LLM-as-a-Judge for Time Series Explanations

論文の概要: LLM-as-a-Judge for Time Series Explanations

arxiv url: http://arxiv.org/abs/2604.02118v1
Date: Thu, 02 Apr 2026 14:55:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.866013
Title: LLM-as-a-Judge for Time Series Explanations
Title（参考訳）: LLM-as-a-Judge for Time Series Explanations
Authors: Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar,
Abstract要約: 本研究では,参照自由条件下での時系列記述のジェネレータおよび評価器として,大規模言語モデルについて検討する。我々は、7種類のクエリータイプに対して350の時系列ケースのベンチマークを構築し、それぞれが正しい、部分的に正しい、誤った説明をペアリングした。我々は、説明生成、相対ランク付け、独立スコアリング、複数異常検出の4つのタスクにわたるモデルを評価する。
参考スコア（独自算出の注目度）: 7.771378647684901
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.
Abstract（参考訳）: LLMが生成した時系列データに基づく自然言語説明の事実的正確性を評価することは、未解決の課題である。参照ベースの類似度指標と整合性チェックモデルでは、基底的な真理的な説明が必要であり、従来の時系列法は数値に基づいて純粋に動作し、自由形式のテキスト推論を評価できない。したがって、事前に定義された参照やタスク固有のルールを使わずに、説明が基礎となる時系列データに忠実かどうかを直接検証する汎用的手法は存在しない。本研究では, 時系列, 質問, 候補説明が与えられた基準自由条件下での時系列説明の生成元および評価子として, パターン識別, 数値精度, 回答忠実度に基づいて3次正解ラベルを割り当て, 原理的スコアリングと比較を可能にした。これをサポートするために,7つのクエリタイプに対して350の時系列ケースを合成したベンチマークを構築した。我々は、説明生成、相対ランク付け、独立スコアリング、複数異常検出の4つのタスクにわたるモデルを評価する。生成は高いパターン依存であり、特定のクエリタイプに対して体系的な失敗を示し、季節的なドロップとボラティリティシフトは0.00から0.12、構造的ブレークは0.94から0.96まで、評価はより安定しており、モデルが正しい出力であっても、説明を正しくランク付けし評価する。これらの結果は、時系列説明のためのデータ基底式LCMによる評価の実現可能性を示し、時系列領域におけるデータ基底型推論の信頼性評価者としての可能性を強調した。

論文の概要: LLM-as-a-Judge for Time Series Explanations

関連論文リスト