Related papers: Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity

Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity

URL: http://arxiv.org/abs/2507.05816v1
Date: Tue, 08 Jul 2025 09:36:14 GMT
Title: Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity
Authors: Shuai Zhao, Yulin Zhang, Luwei Xiao, Xinyi Wu, Yanhao Jia, Zhongliang Guo, Xiaobao Wu, Cong-Duy Nguyen, Guoming Zhang, Anh Tuan Luu,
Abstract summary: Large language models' capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored.<n>We introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels.<n>We propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies.
Score: 34.80765908439636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model's ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.

Related papers

Reasoning Models Can be Easily Hacked by Fake Reasoning Bias [59.79548223686273]
We introduce THEATER, a comprehensive benchmark to evaluate Reasoning Theater Bias (RTB)<n>We investigate six bias types including Simple Cues and Fake Chain-of-Thought.<n>We identify'shallow reasoning'-plausible but flawed arguments-as the most potent form of RTB.
arXiv Detail & Related papers (2025-07-18T09:06:10Z)
LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP [2.2615384250361004]
This study introduces a novel LLM-augmented clinical NLP pipeline that employs domain-adapted large language models for symptom extraction, contextual reasoning, and correlation from free-text reports.<n> Evaluations on MIMIC-III and CARDIO-NLP datasets demonstrate improved performance in precision, recall, F1-score, and AUROC, with high clinical relevance.
arXiv Detail & Related papers (2025-07-15T07:32:16Z)
Statistical Learning for Heterogeneous Treatment Effects: Pretraining, Prognosis, and Prediction [40.96453902709292]
We propose pretraining strategies that leverage a phenomenon in real-world applications.<n>In medicine, components of the same biological signaling pathways frequently influence both baseline risk and treatment response.<n>We use this structure to incorporate side information and develop models that can exploit synergies between risk prediction and causal effect estimation.
arXiv Detail & Related papers (2025-05-01T05:12:14Z)
Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.86300466350013]
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities.<n>We present a benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets.
arXiv Detail & Related papers (2025-04-14T07:14:27Z)
Evaluation of the impact of expert knowledge: How decision support scores impact the effectiveness of automatic knowledge-driven feature engineering (aKDFE) [0.8272083537040182]
Adverse Drug Events (ADEs) pose significant healthcare challenges, impacting patient safety and costs.<n>This study evaluates automatic Knowledge-Driven Feature Engineering (aKDFE) for improved ADE prediction from Electronic Health Record (EHR) data.<n>We investigated how incorporating domain-specific ADE risk scores for prolonged heart QT interval affects prediction performance using EHR data and medication handling events.
arXiv Detail & Related papers (2025-04-08T11:34:38Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Fragility-aware Classification for Understanding Risk and Improving Generalization [6.926253982569273]
We introduce the Fragility Index (FI), a novel metric that evaluates classification performance from a risk-averse perspective.<n>We derive exact reformulations for cross-entropy loss, hinge-type loss, and Lipschitz loss, and extend the approach to deep learning models.
arXiv Detail & Related papers (2025-02-18T16:44:03Z)
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation [28.61326111959728]
Large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. Our study addresses this knowledge gap by simulating two critical phases of the retrieval-augmented generation (RAG) framework. Contrary to previous findings, our results reveal no significant self-preference effect in RAG frameworks.
arXiv Detail & Related papers (2024-10-28T08:32:09Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
A standardized framework for risk-based assessment of treatment effect heterogeneity in observational healthcare databases [60.07352590494571]
The aim of this study was to extend this approach to the observational setting using a standardized scalable framework. We demonstrate our framework by evaluating the effect of angiotensin-converting enzyme (ACE) inhibitors versus beta blockers on three efficacy and six safety outcomes.
arXiv Detail & Related papers (2020-10-13T14:48:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.