Fugu-MT 論文翻訳(概要): TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

論文の概要: TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

arxiv url: http://arxiv.org/abs/2605.18843v1
Date: Wed, 13 May 2026 05:01:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.660715
Title: TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
Title（参考訳）: TEMPO: 信頼できるLLMバックテストのためのモード分離ポリシー最適化による時間的強化
Authors: Zeyu Zhang, Bradly C. Stadie,
Abstract要約: 歴史的イベントにおいて大きな言語モデルをバックテストするには、特定のカットオフ日までに利用可能な情報のみを推論する必要がある。モデルは、事前訓練から推論へ、明らかな正確さを膨らませ、評価の妥当性を損なうために、定期的にカットオフ後の知識をリークする。本稿では,この規律を2つのコントリビューションを通じてトレーニングするTEMPOを提案する。リークモードがポストカット後の要求をゼロに駆動する2モード報酬と,時間的に有効な推論戦略をモデルが発見できるGRPOベースのトレーニングパイプラインである。
参考スコア（独自算出の注目度）: 6.428238071336693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Backtesting large language models on historical events requires reasoning exclusively from information available before a specified cutoff date. Yet models routinely leak post-cutoff knowledge from pre-training into their reasoning, inflating apparent accuracy and undermining evaluation validity. Prompt-based constraints fail when suppressed content is causally related to the prediction, and knowledge unlearning cannot address this problem because temporal compliance is instance-specific: the same fact may be legitimate evidence for one cutoff date and a violation for another. Rather than erasing knowledge, the model must learn temporal discipline: selecting evidence conditioned on each instance's cutoff date. We propose TEMPO (Temporal Enforcement via Mode-separated Policy Optimization), which trains this discipline via two contributions: (1) a two-mode reward where a leakage mode drives post-cutoff claims to zero as a hard prerequisite before a performance mode optimizes task performance; and (2) a GRPO-based training pipeline that enables the model to discover temporally valid reasoning strategies. We prove that training monotonically decreases leakage, converges to the leak-free optimum, and improves task performance once compliance is achieved. On three prediction tasks and two models, TEMPO reduces leakage from 2~13% to 0.6~3.7% across all conditions, with task performance improving 6~13% where strong pre-cutoff signals exist and maintained where the prediction task is inherently difficult from valid information alone.
Abstract（参考訳）: 歴史的イベントにおいて大きな言語モデルをバックテストするには、特定のカットオフ日までに利用可能な情報のみを推論する必要がある。しかし、モデルは常に事前学習から推論に知識を漏らし、明らかな正確さを増し、評価の妥当性を損なう。プロンプトに基づく制約は、抑制されたコンテンツが予測と因果関係にあるときに失敗し、学習されていない知識は、時間的コンプライアンスがインスタンス固有のものであるため、この問題に対処できない。知識を消去するのではなく、モデルは時間的規律(各インスタンスのカットオフ日に条件付けられた証拠を選択する)を学ぶ必要がある。本稿では,(1)リークモードがタスクパフォーマンスを最適化する前に,カットオフ後の要求をゼロにする2モードの報酬,(2)時間的に有効な推論戦略をモデルが発見できるGRPOベースのトレーニングパイプライン,という2つの貢献を通じて,この規律を訓練するTEMPOを提案する。トレーニングが単調に漏れを減らし、漏れのない最適条件に収束し、コンプライアンスが達成されればタスク性能を向上させることを実証する。 3つの予測タスクと2つのモデルにおいて、TEMPOは全ての条件で2〜13%から0.6～3.7%まで漏れを減らし、タスク性能は6～13%向上し、強い事前遮断信号が存在し、予測タスクは有効な情報だけでは本質的に困難である。

論文の概要: TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

関連論文リスト