Related papers: Can large language models reason about medical questions?

Can large language models reason about medical questions?

URL: http://arxiv.org/abs/2207.08143v4
Date: Sun, 24 Dec 2023 11:17:23 GMT
Title: Can large language models reason about medical questions?
Authors: Valentin Li\'evin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, Ole Winther
Abstract summary: We investigate whether close- and open-source models can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge.
Score: 7.95779617839642
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on three datasets: MedQA-USMLE 60.2%, MedMCQA 62.7% and PubMedQA 78.2%. Open-source models are closing the gap: Llama-2 70B also passed the MedQA-USMLE with 62.5% accuracy.

Related papers

On the Performance of an Explainable Language Model on PubMedQA [1.1484381570538684]
We report results from Gyan, an explainable language model based on an alternative architecture, on the PubmedQA data set. Gyan is trustable, transparent, does not hallucinate and does not require significant training or compute resources.
arXiv Detail & Related papers (2025-04-07T13:42:02Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
AutoMedPrompt: A New Framework for Optimizing LLM Medical Prompts Using Textual Gradients [0.3636228980200798]
Large language models (LLMs) have demonstrated increasingly sophisticated performance in medical and other fields of knowledge. Recent prompt engineering, instead of fine-tuning, has shown potential to boost the performance of general foundation models. We present AutoMedPrompt, which explores the use of textual gradients to elicit medically relevant reasoning.
arXiv Detail & Related papers (2025-02-21T21:17:37Z)
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies. This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
arXiv Detail & Related papers (2024-09-23T17:59:43Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information. We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets. Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z)
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models [35.60385437194243]
Current Medical Large Vision Language Models (Med-LVLMs) frequently encounter factual issues. RAG, which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. We propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the selection of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model.
arXiv Detail & Related papers (2024-07-06T16:45:07Z)
MedConceptsQA: Open Source Medical Concepts QA Benchmark [0.07083082555458872]
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. We conducted evaluations of the benchmark using various Large Language Models.
arXiv Detail & Related papers (2024-05-12T17:54:50Z)
Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks [17.40940406100025]
We introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. Our systems achieved remarkable accuracy across six medical benchmarks. Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8.
arXiv Detail & Related papers (2024-03-30T14:09:00Z)
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text [82.7001841679981]
BioMedLM is a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with larger models. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics.
arXiv Detail & Related papers (2024-03-27T10:18:21Z)
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [73.79091519226026]
Uncertainty of Thoughts (UoT) is an algorithm to augment large language models with the ability to actively seek information by asking effective questions. In experiments on medical diagnosis, troubleshooting, and the 20 Questions game, UoT achieves an average performance improvement of 38.1% in the rate of successful task completion.
arXiv Detail & Related papers (2024-02-05T18:28:44Z)
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge. We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z)
Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model [41.11769935795965]
We present a multi-stage training method combining Domain-specific Continued Pre-training (DCPT), Supervised Fine-tuning (SFT), and Direct Preference Optimization (DPO) In the CPT and SFT phases, Qilin-Med achieved 38.4% and 40.0% accuracy on the CMExam test set, respectively. In the DPO phase, it scored 16.66 in BLEU-1 and 27.44 in ROUGE-1 on the Huatuo-26M test set, bringing further improvement to the SFT phase (12.69 in BLEU-1 and 24.21 in ROUGE-1)
arXiv Detail & Related papers (2023-10-13T13:17:03Z)
PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.