Related papers: Disentangling Reasoning and Knowledge in Medical Large Language Models

Disentangling Reasoning and Knowledge in Medical Large Language Models

URL: http://arxiv.org/abs/2505.11462v2
Date: Tue, 24 Jun 2025 03:27:30 GMT
Title: Disentangling Reasoning and Knowledge in Medical Large Language Models
Authors: Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou,
Abstract summary: Medical reasoning in large language models aims to emulate clinicians' diagnostic thinking.<n>Current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall.<n>We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3)<n>We train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples.
Score: 23.401484250342158
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, HuatuoGPT-o1 scores 56.9 on knowledge but only 44.8 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.

Related papers

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [44.96018028534255]
ReasonMed is the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths.<n>We train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.
arXiv Detail & Related papers (2025-06-11T08:36:55Z)
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports [49.00805568780791]
We introduce MedCaseReasoning, the first open-access dataset for evaluating Large Language Models (LLMs) on their ability to align with clinician-authored diagnostic reasoning.<n>The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements.<n>We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning.
arXiv Detail & Related papers (2025-05-16T22:34:36Z)
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models [21.849783391186754]
We provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1.<n>Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning.<n>We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding.
arXiv Detail & Related papers (2025-04-01T14:57:43Z)
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models [6.176432104264649]
Vision-language models (VLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored.<n>We propose Med-R1, a reinforcement learning (RL)-enhanced vision-language model designed to improve generalization and reliability in medical reasoning.<n>We evaluate Med-R1 across eight distinct medical imaging modalities.
arXiv Detail & Related papers (2025-03-18T06:12:38Z)
Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? [44.265524592991945]
We show that medical models fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering tasks. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities.
arXiv Detail & Related papers (2024-11-06T18:51:02Z)
Brain Tumor Classification on MRI in Light of Molecular Markers [61.77272414423481]
Co-deletion of the 1p/19q gene is associated with clinical outcomes in low-grade gliomas.<n>This study aims to utilize a specially MRI-based convolutional neural network for brain cancer detection.
arXiv Detail & Related papers (2024-09-29T07:04:26Z)
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine [3.471944921180245]
Large Language Models (LLMs) demonstrate significant potential in the medical domain.<n>They are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE.<n>We created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability.
arXiv Detail & Related papers (2024-06-04T15:08:56Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types. Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z)
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark. It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification. We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z)
Probing Pre-Trained Language Models for Disease Knowledge [38.73378973397647]
We introduce DisKnE, a new benchmark for Disease Knowledge Evaluation. We define training-test splits per disease, ensuring that no knowledge about test diseases can be learned from the training data. When analysing pre-trained models for the clinical/biomedical domain on the proposed benchmark, we find that their performance drops considerably.
arXiv Detail & Related papers (2021-06-14T10:31:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.