Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs
- URL: http://arxiv.org/abs/2511.10768v1
- Date: Thu, 13 Nov 2025 19:42:11 GMT
- Title: Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs
- Authors: Ajwad Abrar, Nafisa Tabassum Oeshy, Prianka Maheru, Farzana Tabassum, Tareque Mohmud Chowdhury,
- Abstract summary: We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition.<n>We fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets.<n>Human evaluation shows that over 80% of generated summaries preserve critical medical information.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Summarizing consumer health questions (CHQs) can ease communication in healthcare, but unfaithful summaries that misrepresent medical details pose serious risks. We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition with large language models (LLMs) to enhance faithfulness in medical text summarization. In our experiments, we fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets, achieving consistent improvements across quality (ROUGE, BERTScore, readability) and faithfulness (SummaC, AlignScore) metrics, and outperforming zero-shot baselines and prior systems. Human evaluation further shows that over 80\% of generated summaries preserve critical medical information. These results highlight faithfulness as an essential dimension for reliable medical summarization and demonstrate the potential of our approach for safer deployment of LLMs in healthcare contexts.
Related papers
- Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations.<n>Our benchmark unifies three types of medical data for open-ended diagnostic generation.<n>We present MedConf, an evidence-grounded linguistic self-assessment framework.
arXiv Detail & Related papers (2026-01-22T04:51:39Z) - Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning [49.559151128219725]
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks.<n>However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness.<n>We propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets.
arXiv Detail & Related papers (2025-11-13T08:13:23Z) - MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation [2.3251933592942247]
We introduce MedRepBench, a comprehensive benchmark built from 1,900 de-identified real-world Chinese medical reports.<n>The benchmark is designed primarily to evaluate end-to-end VLMs for structured medical report understanding.<n>We also observe that the OCR+LLM pipeline, despite strong performance, suffers from layout-blindness and latency issues.
arXiv Detail & Related papers (2025-08-21T07:52:45Z) - Large Language Models for Cancer Communication: Evaluating Linguistic Quality, Safety, and Accessibility in Generative AI [0.40744588528519854]
Effective communication about breast and cervical cancers remains a persistent health challenge.<n>This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information.
arXiv Detail & Related papers (2025-05-15T16:23:21Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare [8.378348088931578]
The application of large language models (LLMs) in healthcare holds significant promise for enhancing clinical decision-making, medical research, and patient care.<n>Their integration into real-world clinical settings raises critical concerns around trustworthiness, particularly around dimensions of truthfulness, privacy, safety, robustness, fairness, and explainability.
arXiv Detail & Related papers (2025-02-21T18:43:06Z) - A Mixed-Methods Evaluation of LLM-Based Chatbots for Menopause [7.156867036177255]
The integration of Large Language Models (LLMs) into healthcare settings has gained significant attention.<n>We examine the performance of publicly available LLM-based chatbots for menopause-related queries.<n>Our findings highlight the promise and limitations of traditional evaluation metrics for sensitive health topics.
arXiv Detail & Related papers (2025-02-05T19:56:52Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.<n>Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - HealthQ: Unveiling Questioning Capabilities of LLM Chains in Healthcare Conversations [20.31796453890812]
HealthQ is a framework for evaluating the questioning capabilities of large language models (LLMs) in healthcare conversations.<n>We integrate an LLM judge to evaluate generated questions across metrics such as specificity, relevance, and usefulness.<n>We present the first systematic framework for assessing questioning capabilities in healthcare conversations, establish a model-agnostic evaluation methodology, and provide empirical evidence linking high-quality questions to improved patient information elicitation.
arXiv Detail & Related papers (2024-09-28T23:59:46Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - FaMeSumm: Investigating and Improving Faithfulness of Medical
Summarization [20.7585913214759]
Current summarization models often produce unfaithful outputs for medical input text.
FaMeSumm is a framework to improve faithfulness by fine-tuning pre-trained language models based on medical knowledge.
arXiv Detail & Related papers (2023-11-03T23:25:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.