Fugu-MT 論文翻訳(概要): Large Language Models Lack Temporal Awareness of Medical Knowledge

論文の概要: Large Language Models Lack Temporal Awareness of Medical Knowledge

arxiv url: http://arxiv.org/abs/2605.13045v1
Date: Wed, 13 May 2026 06:04:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.837295
Title: Large Language Models Lack Temporal Awareness of Medical Knowledge
Title（参考訳）: 医学的知識の時間的認識を欠く大規模言語モデル
Authors: Zihan Guan, Qiao Jin, Guangzhi Xiong, Fangyuan Chen, Mengxuan Hu, Qingyu Chen, Yifan Peng, Zhiyong Lu, Anil Vullikanti,
Abstract要約: LLM(Large Language Models)の医学的知識を評価する既存の手法は、主に時間的検査スタイルのベンチマークに基づいている。医用領域におけるLCMの時間的認識をガイドライン知識の進化を通じて評価するための,第一種ベンチマークである TempoMed-Bench を構築した。
参考スコア（独自算出の注目度）: 30.240452466538073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The existing methods for evaluating the medical knowledge of Large Language Models (LLMs) are largely based on atemporal examination-style benchmarks, while in reality, medical knowledge is inherently dynamic and continuously evolves as new evidence emerges and treatments are approved. Consequently, evaluating medical knowledge without a temporal context may provide an incomplete assessment of whether LLMs can accurately reason about time-specific medical knowledge. Moreover, most medical data are historical, requiring the models not only to recall the correct knowledge, but also to know when that knowledge is correct. To bridge the gap, we built TempoMed-Bench, the first-of-its-kind benchmark for evaluating the temporal awareness of the LLMs in the medical domain through evolving guideline knowledge. Based on the TempoMed-Bench, our evaluation analysis first reveals that LLMs lack temporal awareness in medical knowledge through the key findings: (1) model performance on up-to-date medical knowledge exhibits a gradual linear decline over time rather than a sharp knowledge-cutoff behavior, suggesting that parametric medical knowledge is not strictly bounded by knowledge cutoffs; (2) LLMs consistently struggle more with recalling outdated historical medical knowledge than with up-to-date recommendations: accuracy of historical knowledge is only 25.37%-53.89% of up-to-date knowledge, indicating potential knowledge forgetting effects during training; and (3) LLMs often exhibit temporally inconsistent behaviors, where predictions fluctuate irregularly across neighboring years. We also show that the temporal awareness problem is a challenge that cannot be easily solved when integrated with agentic search tools (-3.15%-14.14%). This work highlights an important yet underexplored challenge and motivates future research on developing LLMs that can better encode time-specific medical knowledge.
Abstract（参考訳）: LLM(Large Language Models)の医学的知識を評価する既存の手法は、主に時間的検査スタイルのベンチマークに基づいており、実際には、医学的知識は本質的に動的であり、新たな証拠が出現し、治療が承認されるにつれて継続的に進化する。したがって、時間的文脈なしに医療知識を評価することは、LLMが時間固有の医療知識について正確に推論できるかどうかを不完全な評価を与える可能性がある。さらに、ほとんどの医療データは歴史的であり、モデルが正しい知識を思い出すだけでなく、その知識がいつ正しいかを知る必要がある。このギャップを埋めるために、私たちは、ガイドライン知識の進化を通じて医療領域におけるLLMの時間的認識を評価するための第一種ベンチマークであるTempoMed-Benchを構築しました。評価分析では,(1) 最新の医療知識のモデル性能は,知識遮断行動ではなく,時間とともに漸進的に低下する傾向を示し,(2) パラメトリック医療知識は知識遮断によって厳密に拘束されないこと,(2) 過去の医学知識のリコールに一貫して苦慮していること,(2) 履歴知識の正確さは25.37%-53.89%, トレーニング中の影響を忘れることの潜在的な知識を示すこと,(3) 学習中に不規則に予測が変動する時間的不整合性を示すこと,など,重要な知見を通じて,医学知識に時間的認識が欠如していることを明らかにした。また,エージェント検索ツール(3.15%～14.14%)と統合した場合,時間的認識問題は容易に解決できない課題であることを示す。この研究は、未調査の重要な課題を強調し、時間固有の医療知識をよりうまくエンコードできるLSMの開発に向けた将来の研究を動機付けている。

論文の概要: Large Language Models Lack Temporal Awareness of Medical Knowledge

関連論文リスト