Fugu-MT 論文翻訳(概要): Benchmarking GPT-5 for biomedical natural language processing

論文の概要: Benchmarking GPT-5 for biomedical natural language processing

arxiv url: http://arxiv.org/abs/2509.04462v2
Date: Thu, 23 Oct 2025 15:09:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:09.838688
Title: Benchmarking GPT-5 for biomedical natural language processing
Title（参考訳）: バイオメディカル自然言語処理のためのベンチマークGPT-5
Authors: Yu Hou, Zaifu Zhan, Min Zeng, Yifan Wu, Shuang Zhou, Rui Zhang,
Abstract要約: 本研究は,GPT-5とGPT-4oを5つの中核生物医学的NLPタスクで評価するための統一ベンチマークを拡張した。 GPT-5 は一貫して GPT-4o を上回り、推論集約データセットで最大の利益を得た。
参考スコア（独自算出の注目度）: 17.663813433200122
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Biomedical literature and clinical narratives pose multifaceted challenges for natural language understanding, from precise entity extraction and document synthesis to multi-step diagnostic reasoning. This study extends a unified benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across five core biomedical NLP tasks: named entity recognition, relation extraction, multi-label document classification, summarization, and simplification, and nine expanded biomedical QA datasets covering factual knowledge, clinical reasoning, and multimodal visual understanding. Using standardized prompts, fixed decoding parameters, and consistent inference pipelines, we assessed model performance, latency, and token-normalized cost under official pricing. GPT-5 consistently outperformed GPT-4o, with the largest gains on reasoning-intensive datasets such as MedXpertQA and DiagnosisArena and stable improvements in multimodal QA. In core tasks, GPT-5 achieved better chemical NER and ChemProt scores but remained below domain-tuned baselines for disease NER and summarization. Despite producing longer outputs, GPT-5 showed comparable latency and 30 to 50 percent lower effective cost per correct prediction. Fine-grained analyses revealed improvements in diagnosis, treatment, and reasoning subtypes, whereas boundary-sensitive extraction and evidence-dense summarization remain challenging. Overall, GPT-5 approaches deployment-ready performance for biomedical QA while offering a favorable balance of accuracy, interpretability, and economic efficiency. The results support a tiered prompting strategy: direct prompting for large-scale or cost-sensitive applications, and chain-of-thought scaffolds for analytically complex or high-stakes scenarios, highlighting the continued need for hybrid solutions where precision and factual fidelity are critical.
Abstract（参考訳）: バイオメディカル文献と臨床物語は、正確な実体抽出や文書合成から多段階の診断推論まで、自然言語理解のための多面的課題を提起する。本研究は,GPT-5とGPT-4oを0,1,5ショットで評価するための統一されたベンチマークを拡張した。このベンチマークは,実知,臨床推論,マルチモーダル視覚理解を含む9つのバイオメディカルQAデータセットを対象とし,エンティティ認識,関係抽出,多ラベル文書分類,要約,単純化の5つのコアバイオメディカルNLPタスクに対して促進するものである。標準化されたプロンプト、固定デコードパラメータ、一貫性のある推論パイプラインを使用して、公式価格下でのモデルパフォーマンス、レイテンシ、トークン正規化コストを評価した。 GPT-5は一貫してGPT-4oより優れており、MedXpertQA や diagnosisArena のような推論集約的なデータセットやマルチモーダルQAの安定的な改善が最大の利益となっている。コアタスクでは、GPT-5 はより優れた化学 NER と ChemProt のスコアを得たが、NER と要約のためのドメインチューニングベースライン以下にとどまった。出力が長いにもかかわらず、GPT-5のレイテンシは同等であり、正しい予測毎に30～50%のコストが削減された。微粒化分析では診断,治療,推論サブタイプの改善がみられたが,境界感受性抽出とエビデンス・デンス・サマリゼーションは依然として困難であった。全体として、GPT-5は、正確性、解釈可能性、経済効率のバランスを良好に保ちながら、バイオメディカルQAの展開可能な性能にアプローチする。結果は、大規模またはコストに敏感なアプリケーションへの直接的プロンプトと、分析的に複雑または高精度なシナリオのためのチェーン・オブ・シークレットの足場をサポートし、精度と事実の忠実性が重要となるハイブリッドソリューションの継続的なニーズを強調している。

論文の概要: Benchmarking GPT-5 for biomedical natural language processing

関連論文リスト