HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models
- URL: http://arxiv.org/abs/2512.02299v1
- Date: Tue, 02 Dec 2025 00:38:42 GMT
- Title: HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models
- Authors: Boya Zhang, Alban Bornet, Rui Yang, Nan Liu, Douglas Teodoro,
- Abstract summary: We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict.<n>Our experiments show that the strength of fine-tuned biomedical language models lies in their ability to exploit correct context while resisting incorrect context.
- Score: 9.557404300696538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models' contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.
Related papers
- MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification [51.82420076479152]
We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora.<n>Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts.
arXiv Detail & Related papers (2025-05-24T01:23:09Z) - Do LLMs Provide Consistent Answers to Health-Related Questions across Languages? [14.87110905165928]
We examine the consistency of responses provided by Large Language Models (LLMs) to health-related questions across English, German, Turkish, and Chinese.<n>We reveal significant inconsistencies in responses that could spread healthcare misinformation.<n>Our findings emphasize the need for improved cross-lingual alignment to ensure accurate and equitable healthcare information.
arXiv Detail & Related papers (2025-01-24T18:51:26Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - Context versus Prior Knowledge in Language Models [49.17879668110546]
Language models often need to integrate prior knowledge learned during pretraining and new information presented in context.
We propose two mutual information-based metrics to measure a model's dependency on a context and on its prior about an entity.
arXiv Detail & Related papers (2024-04-06T13:46:53Z) - Evaluating Biases in Context-Dependent Health Questions [16.818168401472075]
We study how large language model biases are exhibited through contextual questions in the healthcare domain.
Our experiments reveal biases in each of these attributes, where young adult female users are favored.
arXiv Detail & Related papers (2024-03-07T19:15:40Z) - Explanatory Argument Extraction of Correct Answers in Resident Medical
Exams [5.399800035598185]
We present a new dataset which includes not only explanatory arguments for the correct answer, but also arguments to reason why the incorrect answers are not correct.
This new benchmark allows us to setup a novel extractive task which consists of identifying the explanation of the correct answer written by medical doctors.
arXiv Detail & Related papers (2023-12-01T13:22:35Z) - FaMeSumm: Investigating and Improving Faithfulness of Medical
Summarization [20.7585913214759]
Current summarization models often produce unfaithful outputs for medical input text.
FaMeSumm is a framework to improve faithfulness by fine-tuning pre-trained language models based on medical knowledge.
arXiv Detail & Related papers (2023-11-03T23:25:53Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Exploring the In-context Learning Ability of Large Language Model for
Biomedical Concept Linking [4.8882241537236455]
This research investigates a method that exploits the in-context learning capabilities of large models for biomedical concept linking.
The proposed approach adopts a two-stage retrieve-and-rank framework.
It achieved an accuracy of 90.% in BC5CDR disease entity normalization and 94.7% in chemical entity normalization.
arXiv Detail & Related papers (2023-07-03T16:19:50Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Probing Pre-Trained Language Models for Disease Knowledge [38.73378973397647]
We introduce DisKnE, a new benchmark for Disease Knowledge Evaluation.
We define training-test splits per disease, ensuring that no knowledge about test diseases can be learned from the training data.
When analysing pre-trained models for the clinical/biomedical domain on the proposed benchmark, we find that their performance drops considerably.
arXiv Detail & Related papers (2021-06-14T10:31:25Z) - Assessing the Severity of Health States based on Social Media Posts [62.52087340582502]
We propose a multiview learning framework that models both the textual content as well as contextual-information to assess the severity of the user's health state.
The diverse NLU views demonstrate its effectiveness on both the tasks and as well as on the individual disease to assess a user's health.
arXiv Detail & Related papers (2020-09-21T03:45:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.