Fugu-MT 論文翻訳(概要): Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study

論文の概要: Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study

arxiv url: http://arxiv.org/abs/2603.22344v1
Date: Sat, 21 Mar 2026 21:39:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.082771
Title: Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study
Title（参考訳）: AIによる医学文献検索における誤り : 比較研究
Authors: Jenny Gao, Yongfeng Zhang, Mary L Disis, Lanjing Zhang,
Abstract要約: 大規模言語モデル (LLM) による文献検索は誤った参照につながる可能性がある。我々は,広く使用されているフリーバージョンLLMプラットフォームの参照検索における誤りを評価する。
参考スコア（独自算出の注目度）: 29.514173936305784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) assisted literature retrieval may lead to erroneous references, but these errors have not been rigorously quantified. Therefore, we quantitatively assess errors in reference retrieval of widely used free-version LLM platforms and identify the factors associated with retrieval errors. We evaluated 2,000 references retrieved by 5 LLMs (Grok-2, ChatGPT GPT-4.1, Google Gemini Flash 2.5, Perplexity AI, and DeepSeek GPT-4) for 40 randomly-selected original articles (10 per journal) published Jan. 2024 to July 2025 from British Medical Journal (BMJ), Journal of the American Medical Association, and The New England Journal of Medicine (NEJM). Primary outcomes were a multimetric score ratio combining validity of digital object identifier, PubMed ID, Google-Scholar link, and relevance; and complete miss rate (proportion of references failing all applicable metrics). Multivariable regression was used to examine independent associations. LLM platforms completely failed to retrieve correct reference data 47.8% of the time. The average score ratio of the 5 LLM platforms was 0.29 (standard deviation, 0.35; range, 0-1.25), with a higher score ratio indicating a higher accuracy in retrieving relevant references and correct bibliographic data. The highest and lowest accuracies were achieved by Grok (0.57) and Genimi (0.11), respectively. Compared with BMJ, NEJM articles had lower score ratios and higher complete miss rates. Multivariable analysis shows LLM platforms and journals were independently associated with score ratios and complete miss rate, respectively. We show modest overall performance of LLMs and significant variability in retrieval accuracy across platforms and journals. LLM platforms and journals are associated with LLM's performance in retrieving medical literature. Bibliographic data should be carefully reviewed when using LLM-assisted literature retrieval.
Abstract（参考訳）: 大規模言語モデル(LLM)による文献検索は誤参照につながる可能性があるが、これらの誤りは厳密に定量化されていない。そこで我々は,広く利用されているフリーバージョンLDMプラットフォームの参照検索における誤りを定量的に評価し,検索エラーに関連する要因を同定する。 Grok-2, ChatGPT GPT-4.1, Google Gemini Flash 2.5, Perplexity AI, DeepSeek GPT-4) で検索された2000件の参考文献を、2024年1月から2025年7月までに英国医学ジャーナル(BMJ)、アメリカ医学会ジャーナル、ニューイングランド医学ジャーナル(NEJM)から40件のランダムに選択されたオリジナル記事(ジャーナルあたり10件)に対して評価した。主な成果は、デジタルオブジェクト識別子、PubMed ID、Google-Scholarリンク、関連性、および完全なミス率(すべての適用基準を満たさない参照の割合)の妥当性を組み合わせたマルチメトリックスコア比であった。多変量回帰は, 独立な関連性を調べるために用いられた。 LLMプラットフォームは47.8%の正確な参照データを取得できなかった。 5LLMプラットフォームの平均スコア率は0.29(標準偏差0.35、範囲0-1.25)であり、関連する文献データと正しい文献データを取得する際の高いスコア比を示した。最高位と最低位はそれぞれGrok(0.57)とGenimi(0.11)によって達成された。 BMJと比較すると,NEJMはスコア比が低く,ミス率も高かった。多変量解析では, LLMプラットフォームとジャーナルはそれぞれ, スコア比と完全ミス率に独立に関連していた。プラットフォームおよびジャーナル間でのLLMの質素な全体的な性能と検索精度の有意な変動を示す。 LLMプラットフォームとジャーナルは、医学文献の検索におけるLLMのパフォーマンスに関連している。 LLMを用いた文献検索では,文献データを慎重に検討する必要がある。

論文の概要: Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study

関連論文リスト