Related papers: CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

URL: http://arxiv.org/abs/2409.15766v1
Date: Tue, 24 Sep 2024 05:44:46 GMT
Title: CHBench: A Chinese Dataset for Evaluating Health in Large Language Models
Authors: Chenlu Guo, Nuo Xu, Yi Chang, Yuan Wu,
Abstract summary: We present CHBench, the first comprehensive Chinese Health-related Benchmark. CHBench includes 6,493 entries related to mental health and 2,999 entries focused on physical health. This dataset serves as a foundation for evaluating Chinese LLMs' capacity to comprehend and generate accurate health-related information.
Score: 19.209493319541693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid development of large language models (LLMs), assessing their performance on health-related inquiries has become increasingly essential. It is critical that these models provide accurate and trustworthy health information, as their application in real-world contexts--where misinformation can have serious consequences for individuals seeking medical advice and support--depends on their reliability. In this work, we present CHBench, the first comprehensive Chinese Health-related Benchmark designed to evaluate LLMs' capabilities in understanding physical and mental health across diverse scenarios. CHBench includes 6,493 entries related to mental health and 2,999 entries focused on physical health, covering a broad spectrum of topics. This dataset serves as a foundation for evaluating Chinese LLMs' capacity to comprehend and generate accurate health-related information. Our extensive evaluations of four popular Chinese LLMs demonstrate that there remains considerable room for improvement in their understanding of health-related information. The code is available at https://github.com/TracyGuo2001/CHBench.

Related papers

Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [51.73411055162861]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z)
MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine [36.08458917280579]
MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation.<n>Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance.
arXiv Detail & Related papers (2025-06-02T02:01:40Z)
A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare [5.765614539740084]
The application of large language models (LLMs) in healthcare has the potential to revolutionize clinical decision-making, medical research, and patient care. As LLMs are increasingly integrated into healthcare systems, several critical challenges must be addressed to ensure their reliable and ethical deployment.
arXiv Detail & Related papers (2025-02-21T18:43:06Z)
Do LLMs Provide Consistent Answers to Health-Related Questions across Languages? [14.87110905165928]
We examine the consistency of responses provided by Large Language Models (LLMs) to health-related questions across English, German, Turkish, and Chinese. We reveal significant inconsistencies in responses that could spread healthcare misinformation. Our findings emphasize the need for improved cross-lingual alignment to ensure accurate and equitable healthcare information.
arXiv Detail & Related papers (2025-01-24T18:51:26Z)
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine [41.71754418349046]
We propose five key principles for safe and trustworthy medical AI, along with ten specific aspects. Under this comprehensive framework, we introduce a novel MedGuard benchmark with 1,000 expert-verified questions. Our evaluation of 11 commonly used LLMs shows that the current language models, regardless of their safety alignment mechanisms, generally perform poorly on most of our benchmarks. This study underscores a significant safety gap, highlighting the crucial need for human oversight and the implementation of AI safety guardrails.
arXiv Detail & Related papers (2024-11-20T06:34:32Z)
The Role of Language Models in Modern Healthcare: A Comprehensive Review [2.048226951354646]
The application of large language models (LLMs) in healthcare has gained significant attention. This review examines the trajectory of language models from their early stages to the current state-of-the-art LLMs.
arXiv Detail & Related papers (2024-09-25T12:15:15Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
HRDE: Retrieval-Augmented Large Language Models for Chinese Health Rumor Detection and Explainability [6.800433977880405]
This paper builds a dataset containing 1.12 million health-related rumors (HealthRCN) through web scraping of common health-related questions. We propose retrieval-augmented large language models for Chinese health rumor detection and explainability (HRDE)
arXiv Detail & Related papers (2024-06-30T11:27:50Z)
Potential Renovation of Information Search Process with the Power of Large Language Model for Healthcare [0.0]
This paper explores the development of the Six Stages of Information Search Model and its enhancement through the application of the Large Language Model (LLM) powered Information Search Processes (ISP) in healthcare.
arXiv Detail & Related papers (2024-06-29T07:00:47Z)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
"MedBench" is a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the largest evaluation dataset (300,901 questions) to cover 43 clinical specialties. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
arXiv Detail & Related papers (2024-06-24T02:25:48Z)
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models [92.04812189642418]
We introduce CARES and aim to evaluate the Trustworthiness of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness.
arXiv Detail & Related papers (2024-06-10T04:07:09Z)
Large Language Model for Mental Health: A Systematic Review [2.9429776664692526]
Large language models (LLMs) have attracted significant attention for potential applications in digital health. This systematic review focuses on their strengths and limitations in early screening, digital interventions, and clinical applications.
arXiv Detail & Related papers (2024-02-19T17:58:41Z)
MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain. This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z)
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language. This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z)
Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries [31.82249599013959]
Large language models (LLMs) are transforming the ways the general public accesses and consumes information. LLMs demonstrate impressive language understanding and generation proficiencies, but concerns regarding their safety remain paramount. It remains unclear how these LLMs perform in the context of non-English languages.
arXiv Detail & Related papers (2023-10-19T20:02:40Z)
SafetyBench: Evaluating the Safety of Large Language Models [54.878612385780805]
SafetyBench is a comprehensive benchmark for evaluating the safety of Large Language Models (LLMs) It comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Our tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts.
arXiv Detail & Related papers (2023-09-13T15:56:50Z)
CMB: A Comprehensive Medical Benchmark in Chinese [67.69800156990952]
We propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. We have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain.
arXiv Detail & Related papers (2023-08-17T07:51:23Z)
A Review on Knowledge Graphs for Healthcare: Resources, Applications, and Promises [52.31710895034573]
This work provides the first comprehensive review of healthcare knowledge graphs (HKGs) It summarizes the pipeline and key techniques for HKG construction, as well as the common utilization approaches. At the application level, we delve into the successful integration of HKGs across various health domains.
arXiv Detail & Related papers (2023-06-07T21:51:56Z)
Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes. To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.