Related papers: Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts

Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts

URL: http://arxiv.org/abs/2509.07755v2
Date: Sat, 20 Sep 2025 03:20:18 GMT
Title: Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts
Authors: Rochana Prih Hastuti, Rian Adam Rajagede, Mansour Al Ghanim, Mengxin Zheng, Qian Lou,
Abstract summary: We introduce the Factuality-Weighted Score (FWS), a metric prioritizing factual accuracy beyond coherence to guide watermarking deployment in medical domains.<n>Our evaluation shows current watermarking methods substantially compromise medical factuality, with entropy shifts degrading medical entity representation.<n>These findings underscore the need for domain-aware watermarking approaches to preserve the integrity of medical content.
Score: 14.42794744856763
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) are adapted to sensitive domains such as medicine, their fluency raises safety risks, particularly regarding provenance and accountability. Watermarking embeds detectable patterns to mitigate these risks, yet its reliability in medical contexts remains untested. Existing benchmarks focus on detection-quality tradeoffs and overlook factual risks. In medical text, watermarking often reweights low-entropy tokens, which are highly predictable and often carry critical medical terminology. Shifting these tokens can cause inaccuracy and hallucinations, risks that prior general-domain benchmarks fail to capture. We propose a medical-focused evaluation workflow that jointly assesses factual accuracy and coherence. Using GPT-Judger and further human validation, we introduce the Factuality-Weighted Score (FWS), a composite metric prioritizing factual accuracy beyond coherence to guide watermarking deployment in medical domains. Our evaluation shows current watermarking methods substantially compromise medical factuality, with entropy shifts degrading medical entity representation. These findings underscore the need for domain-aware watermarking approaches that preserve the integrity of medical content.

Related papers

X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging [67.85884025186755]
High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns.<n>Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images.<n>We propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection.
arXiv Detail & Related papers (2026-02-10T00:03:43Z)
MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs [7.2159153945746795]
Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap.<n>We introduce MediEval, a benchmark that links MIMIC-IV electronic health records to a unified knowledge base built from UMLS and other biomedical vocabularies.<n> MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework.
arXiv Detail & Related papers (2025-12-23T22:52:24Z)
MedRule-KG: A Knowledge-Graph--Steered Scaffold for Reliable Mathematical and Biomedical Reasoning [0.0]
We present MedRule-KG, a compact knowledge-graph scaffold paired with a lightweight verifier that steers generation toward mathematically and biomedically valid outputs.<n>Across 90 tasks spanning reaction feasibility, metabolic compatibility, and toxicity screening, MedRule-KG reduces violation counts by 83.2% relative to a strong chain-of-thought baseline.
arXiv Detail & Related papers (2025-11-17T04:42:52Z)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z)
SURE-Med: Systematic Uncertainty Reduction for Enhanced Reliability in Medical Report Generation [2.2185034594788164]
We propose SURE-Med, a unified framework that systematically reduces uncertainty across three critical dimensions: visual, distributional, and contextual.<n>To mitigate visual uncertainty, a Frontal-Aware View Repair Resampling module corrects view annotation errors and adaptively selects informative features from supplementary views.<n>To tackle label distribution uncertainty, we introduce a Token Sensitive Learning objective that enhances the modeling of critical diagnostic sentences.<n>To reduce contextual uncertainty, our Contextual Evidence Filter validates and selectively incorporates prior information that aligns with the current image, effectively suppressing hallucinations.
arXiv Detail & Related papers (2025-08-03T09:52:30Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
Metrics that matter: Evaluating image quality metrics for medical image generation [48.85783422900129]
This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data.<n>We evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, morphological alterations designed to mimic clinically relevant inaccuracies.
arXiv Detail & Related papers (2025-05-12T01:57:25Z)
AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis [16.21270312974956]
We introduce a novel statistical framework to evaluate the dependency of medical imaging ML models on sensitive attributes, such as demographics.<n>We present a practical algorithm that combines conditional latent diffusion models with statistical hypothesis testing to identify and quantify such biases.
arXiv Detail & Related papers (2025-04-28T09:28:25Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
Revisiting Medical Image Retrieval via Knowledge Consolidation [46.6989555659494]
We propose a novel method to consolidate knowledge of hierarchical features and functions.<n>We introduce Depth-aware Representation Fusion (DaRF) and Structure-aware Contrastive Hashing (SCH)<n>Our method achieves a 5.6-38.9% improvement in mean Average Precision on the anatomical radiology dataset.
arXiv Detail & Related papers (2025-03-12T13:16:42Z)
Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis [14.661742509140995]
MedSign is a deep learning-based watermarking framework specifically designed for text-to-medical image synthesis.<n>We generate a pathology localization map using cross-attention between medical text tokens and the diffusion denoising network.<n>We optimize the LDM decoder to incorporate watermarking during image synthesis.
arXiv Detail & Related papers (2025-03-11T11:55:14Z)
Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their abilities to directly recall and apply factual medical knowledge remains under-explored.<n>We introduce the Medical Knowledge Judgment dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized vocabularies and knowledge graphs.<n>Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
Preventing Unauthorized AI Over-Analysis by Medical Image Adversarial Watermarking [43.17275405041853]
We present a pioneering solution named Medical Image Adversarial watermarking (MIAD-MARK) Our approach introduces watermarks that strategically mislead unauthorized AI diagnostic models, inducing erroneous predictions without compromising the integrity of the visual content. Our solution effectively mitigates unauthorized exploitation of medical images even in the presence of sophisticated watermark removal networks.
arXiv Detail & Related papers (2023-03-17T09:37:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.