Fugu-MT 論文翻訳(概要): Do Large Language Models have Shared Weaknesses in Medical Question Answering?

論文の概要: Do Large Language Models have Shared Weaknesses in Medical Question Answering?

arxiv url: http://arxiv.org/abs/2310.07225v3
Date: Fri, 11 Oct 2024 14:55:44 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-05 07:28:31.54175
Title: Do Large Language Models have Shared Weaknesses in Medical Question Answering?
Title（参考訳）: 医学的質問応答における大言語モデルは弱さを共有しているか?
Authors: Andrew M. Bean, Karolina Korgul, Felix Krones, Robert McCraith, Adam Mahdi,
Abstract要約: 大規模言語モデル(LLM)は、医療ベンチマークで急速に改善されているが、その信頼性の欠如は、安全な現実世界の使用において永続的な課題である。上位のLLMをベンチマークし、モデル間の一貫性のあるパターンを特定します。質問が正しく答えるモデル間の類似性の証拠と、人間のテストテイカーとの類似性を見出した。
参考スコア（独自算出の注目度）: 1.25828876338076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses. To design for the use LLMs as a category, rather than for specific models, requires developing an understanding of shared strengths and weaknesses which appear across models. To address this challenge, we benchmark a range of top LLMs and identify consistent patterns across models. We test $16$ well-known LLMs on $874$ newly collected questions from Polish medical licensing exams. For each question, we score each model on the top-1 accuracy and the distribution of probabilities assigned. We then compare these results with factors such as question difficulty for humans, question length, and the scores of the other models. LLM accuracies were positively correlated pairwise ($0.39$ to $0.58$). Model performance was also correlated with human performance ($0.09$ to $0.13$), but negatively correlated to the difference between the question-level accuracy of top-scoring and bottom-scoring humans ($-0.09$ to $-0.14$). The top output probability and question length were positive and negative predictors of accuracy respectively (p$< 0.05$). The top scoring LLM, GPT-4o Turbo, scored $84\%$, with Claude Opus, Gemini 1.5 Pro and Llama 3/3.1 between $74\%$ and $79\%$. We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers. Larger models typically performed better, but differences in training, architecture, and data were also highly impactful. Model accuracy was positively correlated with confidence, but negatively correlated with question length. We find similar results with older models, and argue that these patterns are likely to persist across future models using similar training methods.
Abstract（参考訳）: 大規模言語モデル(LLM)は、医療ベンチマークで急速に改善されているが、その信頼性の欠如は、安全な現実世界の使用において永続的な課題である。 LLMを特定のモデルではなくカテゴリとして使用するために設計するには、モデルにまたがる共通の強みと弱みを理解する必要がある。この課題に対処するため、私たちはトップレベルのLSMをベンチマークし、モデル間の一貫性のあるパターンを特定します。ポーランドの医療ライセンス試験から新たに収集した質問に対して、有名なLLMを16ドルでテストしました。各質問に対して、各モデルをトップ1の精度と割り当てられた確率の分布に基づいてスコア付けする。次に、これらの結果と、人間の質問難易度、質問長、他のモデルのスコアなどの要因を比較した。 LLMの精度は正の相関関係(0.39$から0.58$)であった。また, モデル性能は人的性能(0.09ドルから0.13ドル)と相関したが, トップスコアとボトムスコアの質問レベル精度(0.09ドルから$-0.14ドル)の差に負の相関が認められた。上位出力確率と質問長はそれぞれ正と負の精度の予測値(p$<0.05$)であった。トップスコアのLPM、GPT-4o Turboは84 %$を獲得し、Claude Opus、Gemini 1.5 Pro、Llama 3/3.1を74 %$から79 %$で獲得した。質問が正しく答えるモデル間の類似性の証拠と、人間のテストテイカーとの類似性を見出した。より大型のモデルは通常より優れた性能を示したが、トレーニング、アーキテクチャ、データの違いも非常に影響を受けていた。モデル精度は信頼度と正の相関を示したが,質問長と負の相関を示した。古いモデルでも同様の結果が得られ、これらのパターンは、同様のトレーニング手法を使用して、将来のモデルにまたがって持続する可能性が高い、と論じる。

論文の概要: Do Large Language Models have Shared Weaknesses in Medical Question Answering?

関連論文リスト