Automatic Evaluation of Healthcare LLMs Beyond Question-Answering
- URL: http://arxiv.org/abs/2502.06666v1
- Date: Mon, 10 Feb 2025 16:52:39 GMT
- Title: Automatic Evaluation of Healthcare LLMs Beyond Question-Answering
- Authors: Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla,
- Abstract summary: We explore correlations between open and close benchmarks and metrics.
As an updated sanity check, we release a new medical benchmark--CareQA-- with both open and closed variants.
We propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to mitigate the identified limitations.
- Score: 0.42131793931438133
- License:
- Abstract: Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model's capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either independently or together, though their relationship remains poorly understood. This work is focused on the healthcare domain, where both factuality and discourse matter greatly. It introduces a comprehensive, multi-axis suite for healthcare LLM evaluation, exploring correlations between open and close benchmarks and metrics. Findings include blind spots and overlaps in current methodologies. As an updated sanity check, we release a new medical benchmark--CareQA--, with both open and closed variants. Finally, we propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to mitigate the identified limitations.
Related papers
- LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.
We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.
Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - ACE-$M^3$: Automatic Capability Evaluator for Multimodal Medical Models [34.81544597731073]
We introduce ACE-$M3$, an open-sourced textbfAutomatic textbfCapability textbfEvaluator for textbfMultimodal textbfMedical textbfModels.
It first utilizes a branch-merge architecture to provide both detailed analysis and a concise final score based on standard medical evaluation criteria.
arXiv Detail & Related papers (2024-12-16T05:15:43Z) - Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings [13.686732204665738]
We extend an existing BBQ dataset by incorporating fill-in-the-blank and short-answer question types.
Our finding reveals that LLMs produce responses that are more biased against certain protected attributes, like age and socio-economic status.
Our debiasing approach combined zero-shot, few-shot, and chain-of-thought could significantly reduce the level of bias to almost 0.
arXiv Detail & Related papers (2024-12-09T01:29:47Z) - A Framework for Evaluating LLMs Under Task Indeterminacy [49.298107503257036]
Large language model (LLM) evaluations often assume there is a single correct response -- a gold label -- for each item in the evaluation corpus.
We develop a framework for evaluating LLMs under task indeterminacy.
arXiv Detail & Related papers (2024-11-21T00:15:44Z) - A Benchmark for Long-Form Medical Question Answering [4.815957808858573]
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA)
Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions.
In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors.
arXiv Detail & Related papers (2024-11-14T22:54:38Z) - LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - Accurate and Nuanced Open-QA Evaluation Through Textual Entailment [4.762213968673381]
We propose to study the entailment relations of answers to identify more informative and more general system answers.
The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
arXiv Detail & Related papers (2024-05-26T21:33:27Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - OpenAUC: Towards AUC-Oriented Open-Set Recognition [151.5072746015253]
Traditional machine learning follows a close-set assumption that the training and test set share the same label space.
Open-Set Recognition (OSR) aims to make correct predictions on both close-set samples and open-set samples.
To fix these issues, we propose a novel metric named OpenAUC.
arXiv Detail & Related papers (2022-10-22T08:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.