Fugu-MT 論文翻訳(概要): Do Large Language Models (LLMs) Understand Chronology?

論文の概要: Do Large Language Models (LLMs) Understand Chronology?

arxiv url: http://arxiv.org/abs/2511.14214v1
Date: Tue, 18 Nov 2025 07:45:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:52.991833
Title: Do Large Language Models (LLMs) Understand Chronology?
Title（参考訳）: 大型言語モデル(LLM)は年表に従うか?
Authors: Pattaraphon Kenny Wongchamcharoen, Paul Glasserman,
Abstract要約: 大規模言語モデル(LLM)は、金融や経済学においてますます使われており、ルックアヘッドバイアスに対する迅速な試みは、モデルが時系列を理解することを暗黙的に仮定している。我々は、モデルが事前学習から既に知っている事実よりも複雑さが増大する一連の時系列順序タスクで、この基本的な問題を検証する。 GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET) and GPT-5 across multiple reasoning-effort settings。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used in finance and economics, where prompt-based attempts against look-ahead bias implicitly assume that models understand chronology. We test this fundamental question with a series of chronological ordering tasks with increasing complexities over facts the model already knows from pre-training. Our tasks cover (1) chronological ordering, (2) conditional sorting (filter, then order), and (3) anachronism detection. We evaluate GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET), and GPT-5 across multiple reasoning-effort settings. Across models, Exact match rate drops sharply as sequences lengthen even while rank correlations stay high as LLMs largely preserve local order but struggle to maintain a single globally consistent timeline. In conditional sorting, most failures stem from the filtering step rather than the ordering step, but GPT-5 and Claude-3.7 Sonnet with Extended Thinking outshine normal models significantly. Lastly, anachronism detection is found to be the easiest task for the LLMs but performance still declines with increasingly overlapping timelines or entities. Overall, our main contribution is showing that allocating explicit reasoning budget helps with chronological ordering with GPT-5 at medium/high reasoning effort achieving flawless ordering at all lengths and perfect conditional sorting (both self-filtered and given-subset), whereas low/minimal effort degrades with longer lists, mirroring earlier models. Our findings delineate limits of current LLMs on chronological tasks, providing insights into task complexity, and demonstrate scenarios in which reasoning helps. These patterns are important for the real-time application of LLMs in finance. We release all code and evaluation templates to support full reproducibility.
Abstract（参考訳）: 大規模言語モデル(LLM)は、金融や経済学においてますます使われており、ルックアヘッドバイアスに対する迅速な試みは、モデルが時系列を理解することを暗黙的に仮定している。我々は、モデルが事前学習から既に知っている事実よりも複雑さが増大する一連の時系列順序タスクで、この基本的な問題を検証する。本課題は,(1)時系列順,(2)条件ソート(フィルタ,次に順序),(3)アナクロニズム検出をカバーしている。 GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET) and GPT-5 across multiple reasoning-effort settings。モデル全体では、LLMが局所的な順序を保ちながら、一貫した一貫したタイムラインを維持するのに苦労しているため、ランク相関は高く保たれているが、シーケンスが長くなるにつれて、厳密なマッチングレートは急激に低下する。条件付きソートでは、ほとんどの障害は順序付けステップではなくフィルタリングステップに由来するが、拡張思考モデルではGPT-5とClaude-3.7 Sonnetは明らかに明るい。最後に、アナクロニズム検出はLLMにとって最も簡単なタスクであることがわかったが、スケジュールやエンティティが重なるにつれてパフォーマンスが低下している。全体として、明確な推論予算を割り当てることによって、全ての長さで不完全な順序付けを達成し、完全な条件ソート(自己フィルターと与えられたサブセットの両方)を達成できる中・高推論において、GPT-5による時間的順序付けが有効であることを示し、一方、低最小の取り組みは、より長いリストで劣化し、以前のモデルを反映している。本研究は,時間的タスクにおける現在のLCMの限界を明らかにし,タスクの複雑さに関する洞察を与え,推論が役立つシナリオを実証する。これらのパターンは金融におけるLLMのリアルタイム適用において重要である。すべてのコードと評価テンプレートをリリースし、完全な再現性をサポートします。

論文の概要: Do Large Language Models (LLMs) Understand Chronology?

関連論文リスト