A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability
- URL: http://arxiv.org/abs/2407.07666v1
- Date: Wed, 10 Jul 2024 13:45:16 GMT
- Title: A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability
- Authors: Ting Fang Tan, Kabilan Elangovan, Jasmine Ong, Nigam Shah, Joseph Sung, Tien Yin Wong, Lan Xue, Nan Liu, Haibo Wang, Chang Fu Kuo, Simon Chesterman, Zee Kin Yeong, Daniel SW Ting,
- Abstract summary: We propose 5 key aspects for evaluation of large language models (LLM)
We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models.
- Score: 5.924966178563408
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications.
Related papers
- SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications [2.838746648891565]
We introduce MEDIC, a framework assessing Large Language Models (LLMs) across five critical dimensions of clinical competence.
We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks.
Results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths.
arXiv Detail & Related papers (2024-09-11T14:44:51Z) - StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation [46.59416831869014]
We propose a novel evaluation framework referred to as StructEval.
Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts.
Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination.
arXiv Detail & Related papers (2024-08-06T16:28:30Z) - SAFETY-J: Evaluating Safety with Critique [24.723999605458832]
We introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment.
We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention.
Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios.
arXiv Detail & Related papers (2024-07-24T08:04:00Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review [11.28580626017631]
We highlight a notable need for a standardized and consistent human evaluation approach.
We propose a comprehensive and practical framework for human evaluation of large language models (LLMs)
This framework aims to improve the reliability, generalizability, and applicability of human evaluation of LLMs in different healthcare applications.
arXiv Detail & Related papers (2024-05-04T04:16:07Z) - Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm [15.627870862369784]
Large language models (LLMs) are gaining increasing interests to improve clinical efficiency for medical diagnosis.
We propose an automatic evaluation paradigm tailored to assess the LLMs' capabilities in delivering clinical services.
arXiv Detail & Related papers (2024-03-25T06:17:54Z) - CriticEval: Evaluating Large Language Model as Critic [110.29766259843453]
CriticEval is a novel benchmark designed to comprehensively and reliably evaluate critique ability of Large Language Models.
To ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios.
To ensure the reliability, a large number of critiques are annotated to serve as references.
arXiv Detail & Related papers (2024-02-21T12:38:59Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - CValues: Measuring the Values of Chinese Large Language Models from
Safety to Responsibility [62.74405775089802]
We present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs.
As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains.
Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility.
arXiv Detail & Related papers (2023-07-19T01:22:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.