Fugu-MT 論文翻訳(概要): Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

論文の概要: Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

arxiv url: http://arxiv.org/abs/2311.16452v1
Date: Tue, 28 Nov 2023 03:16:12 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-29 20:23:35.571250
Title: Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Title（参考訳）: ジェネリスト・ファンデーション・モデルは特殊目的チューニングに勝るか? 医学におけるケーススタディ
Authors: Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, Eric Horvitz
Abstract要約: 本研究は, GPT-4の医学的課題評価における能力について, 専門訓練の欠如による先行研究に基づくものである。イノベーションを促進することで、より深い専門能力が解放され、GPT-4が医学ベンチマークの先行結果に容易に勝っていることが分かる。 Medpromptを使用すると、GPT-4はMultiMedQAスイートのベンチマークデータセットの9つすべてに対して最先端の結果を得る。
参考スコア（独自算出の注目度）: 89.46836590149883
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.
Abstract（参考訳）: GPT-4のような一般的な基礎モデルは、様々な領域やタスクにおいて驚くべき能力を示している。しかし、微調整モデルの専門的な能力にはマッチしないという仮定が一般的である。例えば、医療能力ベンチマークにおけるこれまでのほとんどの調査は、BioGPTやMed-PaLMの取り組みによって実証されたように、ドメイン固有のトレーニングを活用している。本研究は, GPT-4の医学的課題評価における能力について, 専門訓練の欠如による先行研究に基づくものである。モデルのアウトオブボックス機能を強調するために単純なプロンプトを使うのではなく、プロンプトエンジニアリングを体系的に調査する。イノベーションを促進することで、より深い専門的能力が解放され、gpt-4が医療ベンチマークの先行成果を上回ったことが分かります。調査するプロンプトメソッドは汎用的であり、専門分野の専門知識を特に使用せず、専門家によるコンテンツの必要性を排除しています。我々の実験設計は、迅速なエンジニアリングプロセスにおける過度な適合を慎重に制御する。我々は,いくつかのプロンプト戦略の構成に基づき,medpromptを紹介する。 Medpromptを使用すると、GPT-4はMultiMedQAスイートのベンチマークデータセットの9つすべてに対して最先端の結果を得る。この手法は、Med-PaLM 2のような主要なスペシャリストモデルよりも、桁違いに少ない精度で性能を向上する。 MedpromptによるGPT-4のステアリングは、MedQAデータセットの27%のエラー率を、これまでスペシャリストモデルで達成された最良のメソッドに対して達成し、初めて90%を超えた。医療問題以外にも,電気工学,機械学習,哲学,会計学,法学,看護学,臨床心理学における試験戦略の研究を通じて,medpromptが他の領域に一般化し,そのアプローチが広く適用可能であることを示す。

論文の概要: Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

関連論文リスト