Related papers: Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

URL: http://arxiv.org/abs/2502.14275v1
Date: Thu, 20 Feb 2025 05:27:51 GMT
Title: Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment
Authors: Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu,
Abstract summary: Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.<n>Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.<n>We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
Score: 108.55277188617035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have been widely adopted in various downstream task domains. However, their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate how well LLMs encode, retain, and recall fundamental medical facts. To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ is constructed from the Unified Medical Language System (UMLS), a large-scale repository of standardized biomedical vocabularies and knowledge graphs. We frame knowledge assessment as a binary judgment task, requiring LLMs to verify the correctness of medical statements extracted from reliable and structured knowledge sources. Our experiments reveal that LLMs struggle with factual medical knowledge retention, exhibiting significant performance variance across different semantic categories, particularly for rare medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.

Related papers

Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Reliable and diverse evaluation of LLM medical knowledge mastery [6.825565574784612]
We propose a novel framework that generates reliable and diverse test samples to evaluate medical-specific LLMs. We use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs.
arXiv Detail & Related papers (2024-09-22T03:13:38Z)
Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning [42.38865072597821]
Large Language Models (LLMs) are versatile and demonstrate impressive generalization ability. They still exhibit reasoning mistakes, often stemming from knowledge deficiencies. We propose a label-free curricular meaningful learning framework (LaMer) to diagnose and remedy the knowledge deficiencies of LLMs.
arXiv Detail & Related papers (2024-08-21T08:39:49Z)
MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering [5.065947993017158]
Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. We examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews.
arXiv Detail & Related papers (2024-06-09T16:33:28Z)
MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge [4.8004472307210255]
Large language models (LLMs) have excelled across domains, delivering notable performance on medical evaluation benchmarks. However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios. We develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge.
arXiv Detail & Related papers (2024-06-05T04:15:07Z)
Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models [89.13883089162951]
Model editing aims to precisely alter the behaviors of large language models (LLMs) in relation to specific knowledge. This approach has proven effective in addressing issues of hallucination and outdated information in LLMs. However, the potential of using model editing to modify knowledge in the medical field remains largely unexplored.
arXiv Detail & Related papers (2024-02-28T06:40:57Z)
KnowTuning: Knowledge-aware Fine-tuning for Large Language Models [83.5849717262019]
We propose a knowledge-aware fine-tuning (KnowTuning) method to improve fine-grained and coarse-grained knowledge awareness of LLMs. KnowTuning generates more facts with less factual error rate under fine-grained facts evaluation.
arXiv Detail & Related papers (2024-02-17T02:54:32Z)
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language. This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z)
Quantifying Self-diagnostic Atomic Knowledge in Chinese Medical Foundation Model: A Computational Analysis [55.742339781494046]
Foundation Models (FMs) have the potential to revolutionize the way users self-diagnose through search engines by offering direct and efficient suggestions. Recent studies primarily focused on the quality of FMs evaluated by GPT-4 or their ability to pass medical exams. No studies have quantified the extent of self-diagnostic atomic knowledge stored in FMs' memory.
arXiv Detail & Related papers (2023-10-18T05:42:22Z)
Don't Ignore Dual Logic Ability of LLMs while Privatizing: A Data-Intensive Analysis in Medical Domain [19.46334739319516]
We study how the dual logic ability of LLMs is affected during the privatization process in the medical domain. Our results indicate that incorporating general domain dual logic data into LLMs not only enhances LLMs' dual logic ability but also improves their accuracy.
arXiv Detail & Related papers (2023-09-08T08:20:46Z)
Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Reliable Response Generation in Chinese [29.389119917322102]
Large Language Models (LLMs) have demonstrated remarkable success in diverse natural language processing (NLP) tasks in general domains. We propose knowledge-tuning, which leverages structured medical knowledge bases for the LLMs to grasp domain knowledge efficiently. We also release cMedKnowQA, a Chinese medical knowledge question-answering dataset constructed from medical knowledge bases.
arXiv Detail & Related papers (2023-09-08T07:42:57Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.