Related papers: Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

URL: http://arxiv.org/abs/2505.06046v2
Date: Thu, 15 May 2025 15:14:47 GMT
Title: Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information
Authors: Joshua Harris, Fan Grayson, Felix Feldman, Timothy Laurence, Toby Nonnenmacher, Oliver Higgins, Leo Loman, Selina Patel, Thomas Finnie, Samuel Collins, Michael Borowitz,
Abstract summary: This paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating Large Language Models (LLMs)<n>We extract free text from 687 current UK government guidance documents and implement an automated pipeline for generating Multiple Choice Question Answering (MCQA) samples.<n> Assessing 24 LLMs on PubHealthBench we find the latest private LLMs have a high degree of knowledge, achieving >90% accuracy in the MCQA setup, and outperform humans with cursory search engine use.
Score: 0.42862350984126624
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, currently little is known about LLM knowledge of UK Government public health information. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries. To create PubHealthBench we extract free text from 687 current UK government guidance documents and implement an automated pipeline for generating MCQA samples. Assessing 24 LLMs on PubHealthBench we find the latest private LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% accuracy in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Importantly we find in both setups LLMs have higher accuracy on guidance intended for the general public. Therefore, there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, but additional safeguards or tools may still be needed when providing free form responses on public health topics.

Related papers

Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases [7.894865736540358]
Large Language Models (LLMs) are used in high-stakes applications such as medical (self-diagnosis) and preliminary triage.<n>This paper presents the findings from a university-level competition that leveraged a novel, crowdsourced approach for evaluating the effectiveness of LLMs.
arXiv Detail & Related papers (2025-06-13T17:12:47Z)
MIRIAD: Augmenting LLMs with millions of medical query-response pairs [36.32674607022871]
We introduce MIRIAD, a large-scale, curated corpus of 5,821,948 medical QA pairs.<n>We show that MIRIAD improves accuracy up to 6.7% compared to unstructured RAG baselines.<n>We also introduce MIRIAD-Atlas, an interactive map of MIRIAD spanning 56 medical disciplines.
arXiv Detail & Related papers (2025-06-06T13:52:32Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.<n>Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.<n>We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Evaluating Large Language Models for Public Health Classification and Extraction Tasks [0.3545046504280562]
We present evaluations of Large Language Models (LLMs) for public health tasks involving the classification and extraction of free text.<n>We evaluate eleven open-weight LLMs across all tasks using zero-shot in-context learning.<n>We find promising signs that LLMs may be useful tools for public health experts to extract information from a wide variety of free text sources.
arXiv Detail & Related papers (2024-05-23T16:33:18Z)
Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering [67.94354589215637]
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. In this paper, we perceive the LLMs' knowledge boundary (KB) with semi-open-ended questions (SoeQ) We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB. Our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
arXiv Detail & Related papers (2024-05-23T10:00:14Z)
OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models [4.556924372105915]
Open-source (OS) models represent a key area of growth for medical LLMs. We present OpenMedLM, a prompting platform which delivers state-of-the-art (SOTA) performance for OS LLMs on medical benchmarks.
arXiv Detail & Related papers (2024-02-29T17:19:39Z)
Retrieval Augmented Thought Process for Private Data Handling in Healthcare [53.89406286212502]
We introduce the Retrieval-Augmented Thought Process (RATP) RATP formulates the thought generation of Large Language Models (LLMs) On a private dataset of electronic medical records, RATP achieves 35% additional accuracy compared to in-context retrieval-augmented generation for the question-answering task.
arXiv Detail & Related papers (2024-02-12T17:17:50Z)
Large Language Models: A Survey [66.39828929831017]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.<n>LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
Understanding the concerns and choices of public when using large language models for healthcare [17.306501965944978]
Large language models (LLMs) have shown their potential in biomedical fields. How the public uses them for healthcare purposes such as medical Q&A, self-diagnosis, and daily healthcare information seeking is under-investigated.
arXiv Detail & Related papers (2024-01-17T09:51:32Z)
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language. This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z)
Quantifying Self-diagnostic Atomic Knowledge in Chinese Medical Foundation Model: A Computational Analysis [55.742339781494046]
Foundation Models (FMs) have the potential to revolutionize the way users self-diagnose through search engines by offering direct and efficient suggestions. Recent studies primarily focused on the quality of FMs evaluated by GPT-4 or their ability to pass medical exams. No studies have quantified the extent of self-diagnostic atomic knowledge stored in FMs' memory.
arXiv Detail & Related papers (2023-10-18T05:42:22Z)
Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)<n>LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.<n>We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.