The Biased Oracle: Assessing LLMs' Understandability and Empathy in Medical Diagnoses
- URL: http://arxiv.org/abs/2511.00924v1
- Date: Sun, 02 Nov 2025 13:01:07 GMT
- Title: The Biased Oracle: Assessing LLMs' Understandability and Empathy in Medical Diagnoses
- Authors: Jianzhou Yao, Shunchang Liu, Guillaume Drui, Rikard Pettersson, Alessandro Blasimme, Sara Kijewski,
- Abstract summary: We evaluate two leading large language models (LLMs) on medical diagnostic scenarios.<n>The results indicate that LLMs adapt explanations to socio-demographic variables and patient conditions.<n>However, they also generate overly complex content and display biased affective empathy, leading to uneven accessibility and support.
- Score: 35.62689455079826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) show promise for supporting clinicians in diagnostic communication by generating explanations and guidance for patients. Yet their ability to produce outputs that are both understandable and empathetic remains uncertain. We evaluate two leading LLMs on medical diagnostic scenarios, assessing understandability using readability metrics as a proxy and empathy through LLM-as-a-Judge ratings compared to human evaluations. The results indicate that LLMs adapt explanations to socio-demographic variables and patient conditions. However, they also generate overly complex content and display biased affective empathy, leading to uneven accessibility and support. These patterns underscore the need for systematic calibration to ensure equitable patient communication. The code and data are released: https://github.com/Jeffateth/Biased_Oracle
Related papers
- A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z) - From Generation to Collaboration: Using LLMs to Edit for Empathy in Healthcare [2.933252902952646]
This study investigates how large language models (LLMs) can function as empathy editors.<n> Experimental results show that LLM edited responses significantly increase perceived empathy.<n>These findings suggest that using LLMs as editorial assistants, rather than autonomous generators, offers a safer, more effective pathway to empathetic and trustworthy AI-assisted healthcare communication.
arXiv Detail & Related papers (2026-01-22T00:56:33Z) - M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z) - Visualizing token importance for black-box language models [48.747801442240565]
We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings.<n>We propose Distribution-Based Sensitivity Analysis (DBSA) to evaluate the sensitivity of the output of a language model for each input token.
arXiv Detail & Related papers (2025-12-12T14:01:43Z) - Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation [7.867257950096845]
Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients.<n>Large language models (LLMs) have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear.<n>We conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants.<n>Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension.
arXiv Detail & Related papers (2025-05-15T15:31:17Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their abilities to directly recall and apply factual medical knowledge remains under-explored.<n>We introduce the Medical Knowledge Judgment dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized vocabularies and knowledge graphs.<n>Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning [25.068780967617485]
Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions.<n>We propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction.<n>APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement.
arXiv Detail & Related papers (2025-02-11T00:13:52Z) - MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters [1.6135243915480502]
Large language models (LLMs) offer solutions by simplifying medical information.<n> evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources.<n>MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline.
arXiv Detail & Related papers (2025-02-05T15:56:37Z) - Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges [18.56314471146199]
Large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible.
We propose and evaluate a zero-shot strategy for using LLMs as a mechanism to efficiently retrieve and summarize unstructured evidence in patient EHR.
arXiv Detail & Related papers (2023-09-08T18:44:47Z) - Don't Ignore Dual Logic Ability of LLMs while Privatizing: A
Data-Intensive Analysis in Medical Domain [19.46334739319516]
We study how the dual logic ability of LLMs is affected during the privatization process in the medical domain.
Our results indicate that incorporating general domain dual logic data into LLMs not only enhances LLMs' dual logic ability but also improves their accuracy.
arXiv Detail & Related papers (2023-09-08T08:20:46Z) - SPeC: A Soft Prompt-Based Calibration on Performance Variability of
Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization.
Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.