Fugu-MT 論文翻訳(概要): Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

論文の概要: Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

arxiv url: http://arxiv.org/abs/2604.24690v1
Date: Mon, 27 Apr 2026 16:50:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.183496
Title: Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
Title（参考訳）: LLMは歴史学者として機能するか? : 中国帝国試験によるLLMの歴史的研究能力の評価
Authors: Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu, Yiming Zhang, Jia Zhou, Yanfei Zhang, Junbo Zhao,
Abstract要約: ProHist-Bench(プロヒスト・ベンチ)は、中国帝国試験(ケジュ)システムにインストールされた新しいベンチマークである。 8つの王朝に400の挑戦的で専門家による質問があり、10,891のきめ細かい評価ルーリックが伴っている。最先端のLLMでさえ、複雑な歴史的研究課題に苦しむ。
参考スコア（独自算出の注目度）: 11.650720838376634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.
Abstract（参考訳）: 大規模言語モデル(LLM)はテキスト処理などの歴史的タスクをますます支援してきたが、プロフェッショナルレベルの歴史的推論の能力はいまだ探索されていない。既存のベンチマークは、主に基礎知識の広さや語彙的理解を評価し、歴史的研究の中心である明らかな推論のような高次のスキルを捉えていない。このギャップを埋めるために,1300年以上にわたる東アジアの政治・社会・知的歴史の包括的マイクロスコープである,中国帝国試験(ケジュ)システムに係わる新しいベンチマークであるProHist-Benchを紹介する。 ProHist-Benchは、深い学際的なコラボレーションを通じて開発され、8つの王朝にまたがる400の挑戦的で専門家による質問と10,891のきめ細かい評価ルーブを伴っている。 18のLLMの厳密な評価を通じて、我々は、最先端のLLMでさえ、複雑な歴史的研究課題に苦しむ、有能なギャップを明らかにした。我々は ProHist-Bench がドメイン固有推論 LLM の開発を容易にし、計算史研究を進展させ、LLM の未解決の可能性を明らかにすることを願っている。 https://github.com/inclusionAI/ABench/tree/main/ProHist-BenchでProHist-Benchをリリースします。

論文の概要: Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

関連論文リスト