Fugu-MT 論文翻訳(概要): GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

論文の概要: GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

arxiv url: http://arxiv.org/abs/2606.08036v1
Date: Sat, 06 Jun 2026 07:56:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.682825
Title: GIScholarBench: Benchmarking LLM Overconfidence in GIS Research
Title（参考訳）: GIScholarBench: GISリサーチにおけるLLM過信のベンチマーク
Authors: Zongrng Li, Mingzheng Yang, Lei Zou, Hongxu Ma, Hao Tian, Siqi Zhou, Wenjing Gong, Kaili Zhang, Bingqian Chen, Mitch Zhang, Yifan Yang,
Abstract要約: 大規模言語モデル(LLM)は、学術研究でますます使われているが、学術的なタスクは高い事実的精度を必要とする。 GIScholarBenchは、2020年から2025年にかけて25コアのGIScienceジャーナルに掲載された10,865の論文から構築されたベンチマークである。我々は,実世界のユーザ対応環境下でネイティブなWebインターフェースを通じて,Claude Sonnet 4.5,Gemini 3,ChatGPT 5.3を評価した。
参考スコア（独自算出の注目度）: 14.111940657521489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and therefore expose a key weakness: overconfidence. Here, overconfidence is defined behaviorally as the tendency to produce confident, assertive, and well-formatted outputs even when the underlying knowledge is incomplete or unverifiable, rather than as a calibration gap between stated confidence and accuracy. To examine this issue, we introduce GIScholarBench, a benchmark built from 10,865 papers published in 25 core GIScience journals between 2020 and 2025. The benchmark covers three tasks with increasing cognitive complexity: metadata retrieval, literature linking, and research direction generation. We evaluate Claude Sonnet 4.5, Gemini 3, and ChatGPT 5.3 through their native web interfaces under real-world user-facing conditions. Results show consistent overconfidence across all tasks. In metadata retrieval, ChatGPT 5.3 achieves the highest accuracy, but all models still generate definitive titles and DOIs when predictions are wrong. In literature linking, Claude Sonnet 4.5 recovers the most references, but all models show a clear gap between top-ranked retrieval and longer citation lists, suggesting that references are extended beyond reliable retrieval capacity. In research direction generation, AI-generated directions show lower topic coverage, higher novel miss rates, and lower semantic diversity than real future-citing papers. These findings suggest that LLM overconfidence is task-invariant but takes different forms: factual overgeneration in retrieval, unreliable citation expansion in literature linking, and overconfidence in output completeness during research ideation.
Abstract（参考訳）: 大規模言語モデル(LLM)は、学術研究のワークフローでますます使われているが、学術的なタスクには高い事実の精度が必要であり、従って重要な弱点である過信を露呈する。ここでは、過信は、主張された自信と正確さの間のキャリブレーションのギャップとしてではなく、基礎となる知識が不完全であるか検証不可能である場合でも、自信、断定的、そして十分に整合されたアウトプットを生み出す傾向として、行動的に定義される。 GIScholarBenchは、2020年から2025年にかけて25のコアGIScienceジャーナルに掲載された10,865の論文から構築されたベンチマークである。このベンチマークは、メタデータ検索、文学リンク、研究方向生成という、認知的複雑性を増大させる3つのタスクをカバーしている。我々は,実世界のユーザ対応環境下でネイティブなWebインターフェースを通じて,Claude Sonnet 4.5,Gemini 3,ChatGPT 5.3を評価した。結果はすべてのタスクに一貫した過信を示す。メタデータ検索では、ChatGPT 5.3が最も精度が高いが、予測が間違っていれば、すべてのモデルが決定的なタイトルとDOIを生成する。文学的なリンクでは、Claude Sonnet 4.5が最も参照を回復するが、すべてのモデルは上位の検索と長い引用リストの間に明確なギャップを示しており、参照は信頼性の高い検索能力を超えて拡張されていることを示唆している。研究方向生成では、AIが生成する方向は、トピックのカバレッジが低く、新しいミス率が高く、セマンティックな多様性が将来の論文よりも低いことを示している。これらの結果から, LLM過信はタスク不変であるが, 検索における事実過剰生成, 文献リンクにおける信頼できない引用展開, 研究思想におけるアウトプット完全性への過信など, 異なる形態を採っていることが示唆された。

論文の概要: GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

関連論文リスト