Related papers: Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

URL: http://arxiv.org/abs/2512.16530v1
Date: Thu, 18 Dec 2025 13:37:58 GMT
Title: Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics
Authors: Primoz Kocbek, Leon Kopitar, Gregor Stiglic,
Abstract summary: This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy.<n>We developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach.
Score: 1.4984469763984425
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.

Related papers

EQ-5D Classification Using Biomedical Entity-Enriched Pre-trained Language Models and Multiple Instance Learning [0.42970700836450487]
In health economics, systematic literature reviews depend on the correct identification of publications that use the EQ-5D.<n>Manual screening of large volumes of scientific literature is time-consuming, error-prone, and inconsistent.<n>In this study, we investigate fine-tuning of general-purpose (BERT) and domain-specific (SciBERT, BioBERT) pre-trained language models.
arXiv Detail & Related papers (2026-01-30T20:10:34Z)
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z)
Zero-Shot Document-Level Biomedical Relation Extraction via Scenario-based Prompt Design in Two-Stage with LLM [6.26004554105527]
We propose a novel approach to achieve the same results from unannotated full documents using general large language models (LLMs) with lower hardware and labor costs.<n>Our approach combines two major stages: named entity recognition (NER) and relation extraction (RE)<n>To enhance the effectiveness of prompt, we propose a five-part template structure and a scenario-based prompt design principles.
arXiv Detail & Related papers (2025-05-02T07:33:20Z)
UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text [3.223303935767146]
This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8-level audience (13-14 years old students).<n>We tested three approaches using OpenAI's GPt-4o and GPt-4o-mini models: baseline prompt engineering, a two-AI agent approach, and fine-tuning.<n>Results indicated that the two-agent approach and baseline prompt engineering with GPt-4o-mini models show superior qualitative performance, while fine-tuned models excelled in accuracy and completeness but were less simple.
arXiv Detail & Related papers (2025-02-19T23:07:16Z)
Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training [10.701353329227722]
We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature.<n>Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain.<n>Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain.
arXiv Detail & Related papers (2025-01-25T07:20:44Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data [5.443548415516227]
Large Language Models (LLMs) have demonstrated superior performance in question answering and summarization tasks on unstructured text data. We propose an evaluation approach to analyze the performance of open-source LLMs for medical summarization tasks.
arXiv Detail & Related papers (2024-05-25T16:16:22Z)
Diversifying Knowledge Enhancement of Biomedical Language Models using Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models. We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT. We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Benchmarking large language models for biomedical natural language processing applications and recommendations [22.668383945059762]
Large Language Models (LLMs) have shown promise in general domains.<n>We compare their zero-shot, few-shot, and fine-tuning performance with traditional fine-tuning of BERT or BART models.<n>We find issues like missing information and hallucinations in LLM outputs.
arXiv Detail & Related papers (2023-05-10T13:40:06Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.