VERIRAG: Healthcare Claim Verification via Statistical Audit in Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2507.17948v1
- Date: Wed, 23 Jul 2025 21:32:50 GMT
- Title: VERIRAG: Healthcare Claim Verification via Statistical Audit in Retrieval-Augmented Generation
- Authors: Shubham Mohole, Hongjun Choi, Shusen Liu, Christine Klymko, Shashank Kushwaha, Derek Shi, Wesam Sakla, Sainyam Galhotra, Ruben Glatt,
- Abstract summary: VERIRAG is a framework that makes three notable contributions: (i) the Veritable, an 11-point checklist that evaluates each source for methodological rigor, including data integrity and statistical validity; (ii) a Hard-to-Vary (HV) Score, a quantitative aggregator that weights evidence by its quality and diversity; and (iii) a Dynamic Acceptance Threshold, which calibrates the required evidence based on how extraordinary a claim is.
- Score: 12.545868971471844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-augmented generation (RAG) systems are increasingly adopted in clinical decision support, yet they remain methodologically blind-they retrieve evidence but cannot vet its scientific quality. A paper claiming "Antioxidant proteins decreased after alloferon treatment" and a rigorous multi-laboratory replication study will be treated as equally credible, even if the former lacked scientific rigor or was even retracted. To address this challenge, we introduce VERIRAG, a framework that makes three notable contributions: (i) the Veritable, an 11-point checklist that evaluates each source for methodological rigor, including data integrity and statistical validity; (ii) a Hard-to-Vary (HV) Score, a quantitative aggregator that weights evidence by its quality and diversity; and (iii) a Dynamic Acceptance Threshold, which calibrates the required evidence based on how extraordinary a claim is. Across four datasets-comprising retracted, conflicting, comprehensive, and settled science corpora-the VERIRAG approach consistently outperforms all baselines, achieving absolute F1 scores ranging from 0.53 to 0.65, representing a 10 to 14 point improvement over the next-best method in each respective dataset. We will release all materials necessary for reproducing our results.
Related papers
- An Uncertainty-Aware Dynamic Decision Framework for Progressive Multi-Omics Integration in Classification Tasks [6.736267874971369]
We propose an uncertainty-aware, multi-view dynamic decision framework for omics data classification.<n>We employ a fusion strategy based on Dempster-Shafer theory to integrate heterogeneous modalities.<n>In three datasets, over 50% of cases achieved accurate classification using a single omics modality.
arXiv Detail & Related papers (2025-06-20T13:44:14Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences [56.23412698865433]
We focus on Prediction-Powered Causal Inferences (PPCI)<n> PPCI estimates the treatment effect in a target experiment with unlabeled factual outcomes, retrievable zero-shot from a pre-trained model.<n>We validate our method on synthetic and real-world scientific data, offering solutions to instances not solvable by vanilla Empirical Risk Minimization.
arXiv Detail & Related papers (2025-02-10T10:52:17Z) - WithdrarXiv: A Large-Scale Dataset for Retraction Study [33.782357627001154]
We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv.<n>We develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations.<n>We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96.
arXiv Detail & Related papers (2024-12-04T23:36:23Z) - Arges: Spatio-Temporal Transformer for Ulcerative Colitis Severity Assessment in Endoscopy Videos [2.0735422289416605]
Expert MES/UCEIS annotation is time-consuming and susceptible to inter-rater variability.
CNN-based weakly-supervised models with end-to-end (e2e) training lack generalization to new disease scores.
"Arges" is a deep learning framework that incorporates positional encoding to estimate disease severity scores in endoscopy.
arXiv Detail & Related papers (2024-10-01T09:23:14Z) - uMedSum: A Unified Framework for Advancing Medical Abstractive Summarization [23.173826980480936]
Current methods often sacrifice key information for faithfulness or introduce confabulations when prioritizing informativeness.
This paper presents a benchmark of six advanced abstractive summarization methods across three diverse datasets using five standardized metrics.
We propose uMedSum, a modular hybrid summarization framework that introduces novel approaches for sequential confabulation removal followed by key missing information addition.
arXiv Detail & Related papers (2024-08-22T03:08:49Z) - Generating Scientific Claims for Zero-Shot Scientific Fact Checking [54.62086027306609]
Automated scientific fact checking is difficult due to the complexity of scientific language and a lack of significant amounts of training data.
We propose scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences.
We also demonstrate its usefulness in zero-shot fact checking for biomedical claims.
arXiv Detail & Related papers (2022-03-24T11:29:20Z) - Reinforcement Learning with Heterogeneous Data: Estimation and Inference [84.72174994749305]
We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity.
We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class.
We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset.
arXiv Detail & Related papers (2022-01-31T20:58:47Z) - A robust kernel machine regression towards biomarker selection in
multi-omics datasets of osteoporosis for drug discovery [2.2897244874280043]
We propose "robust kernel machine regression (RobMR)," to improve the robustness of statistical machine regression and the diversity of fictional data.
Experiments demonstrate that the proposed approach effectively identifies the inter-related risk factors of osteoporosis.
The proposed approach can be applied be to any disease model multi-omics datasets are available.
arXiv Detail & Related papers (2022-01-13T16:39:46Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - A standardized framework for risk-based assessment of treatment effect
heterogeneity in observational healthcare databases [60.07352590494571]
The aim of this study was to extend this approach to the observational setting using a standardized scalable framework.
We demonstrate our framework by evaluating the effect of angiotensin-converting enzyme (ACE) inhibitors versus beta blockers on three efficacy and six safety outcomes.
arXiv Detail & Related papers (2020-10-13T14:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.