PERCS: Persona-Guided Controllable Biomedical Summarization Dataset
- URL: http://arxiv.org/abs/2512.03340v1
- Date: Wed, 03 Dec 2025 01:13:56 GMT
- Title: PERCS: Persona-Guided Controllable Biomedical Summarization Dataset
- Authors: Rohan Charudatt Salvi, Chirag Chawla, Dhruv Jain, Swapnil Panigrahi, Md Shad Akhtar, Shweta Yadav,
- Abstract summary: PERCS is a dataset of biomedical abstracts paired with summaries tailored to four personas.<n>These personas represent different levels of medical literacy and information needs.<n>Technical validation shows clear differences in readability, vocabulary, and content depth across personas.
- Score: 14.620290128367012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic medical text simplification plays a key role in improving health literacy by making complex biomedical research accessible to diverse readers. However, most existing resources assume a single generic audience, overlooking the wide variation in medical literacy and information needs across user groups. To address this limitation, we introduce PERCS (Persona-guided Controllable Summarization), a dataset of biomedical abstracts paired with summaries tailored to four personas: Laypersons, Premedical Students, Non-medical Researchers, and Medical Experts. These personas represent different levels of medical literacy and information needs, emphasizing the need for targeted, audience-specific summarization. Each summary in PERCS was reviewed by physicians for factual accuracy and persona alignment using a detailed error taxonomy. Technical validation shows clear differences in readability, vocabulary, and content depth across personas. Along with describing the dataset, we benchmark four large language models on PERCS using automatic evaluation metrics that assess comprehensiveness, readability, and faithfulness, establishing baseline results for future research. The dataset, annotation guidelines, and evaluation materials are publicly available to support research on persona-specific communication and controllable biomedical summarization.
Related papers
- Large Language Models for Healthcare Text Classification: A Systematic Review [4.8342038441006805]
Large Language Models (LLMs) have fundamentally transformed approaches to Natural Language Processing (NLP)<n>In healthcare, accurate and cost-efficient text classification is crucial, whether for clinical notes analysis, diagnosis coding, or any other task.<n>Numerous studies have been conducted to leverage LLMs for automated healthcare text classification.
arXiv Detail & Related papers (2025-03-03T04:16:13Z) - A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data.
Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied.
We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z) - Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z) - Readability Controllable Biomedical Document Summarization [17.166794984161964]
We introduce a new task of readability controllable summarization for biomedical documents.
It aims to recognise users' readability demands and generate summaries that better suit their needs.
arXiv Detail & Related papers (2022-10-10T14:03:20Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - Automated Lay Language Summarization of Biomedical Scientific Reviews [16.01452242066412]
Health literacy has emerged as a crucial factor in making appropriate health decisions and ensuring treatment outcomes.
Medical jargon and the complex structure of professional language in this domain make health information especially hard to interpret.
This paper introduces the novel task of automated generation of lay language summaries of biomedical scientific reviews.
arXiv Detail & Related papers (2020-12-23T10:01:18Z) - Question-Driven Summarization of Answers to Consumer Health Questions [17.732729654047983]
We present the MEDIQA Answer Summarization dataset.
This dataset is the first summarization collection containing question-driven summaries of answers to consumer health questions.
We include results of baseline and state-of-the-art deep learning summarization models.
arXiv Detail & Related papers (2020-05-18T20:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.