CHBench: A Chinese Dataset for Evaluating Health in Large Language Models
- URL: http://arxiv.org/abs/2409.15766v1
- Date: Tue, 24 Sep 2024 05:44:46 GMT
- Title: CHBench: A Chinese Dataset for Evaluating Health in Large Language Models
- Authors: Chenlu Guo, Nuo Xu, Yi Chang, Yuan Wu,
- Abstract summary: We present CHBench, the first comprehensive Chinese Health-related Benchmark.
CHBench includes 6,493 entries related to mental health and 2,999 entries focused on physical health.
This dataset serves as a foundation for evaluating Chinese LLMs' capacity to comprehend and generate accurate health-related information.
- Score: 19.209493319541693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of large language models (LLMs), assessing their performance on health-related inquiries has become increasingly essential. It is critical that these models provide accurate and trustworthy health information, as their application in real-world contexts--where misinformation can have serious consequences for individuals seeking medical advice and support--depends on their reliability. In this work, we present CHBench, the first comprehensive Chinese Health-related Benchmark designed to evaluate LLMs' capabilities in understanding physical and mental health across diverse scenarios. CHBench includes 6,493 entries related to mental health and 2,999 entries focused on physical health, covering a broad spectrum of topics. This dataset serves as a foundation for evaluating Chinese LLMs' capacity to comprehend and generate accurate health-related information. Our extensive evaluations of four popular Chinese LLMs demonstrate that there remains considerable room for improvement in their understanding of health-related information. The code is available at https://github.com/TracyGuo2001/CHBench.
Related papers
- Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Do LLMs Provide Consistent Answers to Health-Related Questions across Languages? [14.87110905165928]
We examine the consistency of responses provided by Large Language Models (LLMs) to health-related questions across English, German, Turkish, and Chinese.
We reveal significant inconsistencies in responses that could spread healthcare misinformation.
Our findings emphasize the need for improved cross-lingual alignment to ensure accurate and equitable healthcare information.
arXiv Detail & Related papers (2025-01-24T18:51:26Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - HRDE: Retrieval-Augmented Large Language Models for Chinese Health Rumor Detection and Explainability [6.800433977880405]
This paper builds a dataset containing 1.12 million health-related rumors (HealthRCN) through web scraping of common health-related questions.
We propose retrieval-augmented large language models for Chinese health rumor detection and explainability (HRDE)
arXiv Detail & Related papers (2024-06-30T11:27:50Z) - Potential Renovation of Information Search Process with the Power of Large Language Model for Healthcare [0.0]
This paper explores the development of the Six Stages of Information Search Model and its enhancement through the application of the Large Language Model (LLM) powered Information Search Processes (ISP) in healthcare.
arXiv Detail & Related papers (2024-06-29T07:00:47Z) - MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
"MedBench" is a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM.
First, MedBench assembles the largest evaluation dataset (300,901 questions) to cover 43 clinical specialties.
Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
arXiv Detail & Related papers (2024-06-24T02:25:48Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z) - Better to Ask in English: Cross-Lingual Evaluation of Large Language
Models for Healthcare Queries [31.82249599013959]
Large language models (LLMs) are transforming the ways the general public accesses and consumes information.
LLMs demonstrate impressive language understanding and generation proficiencies, but concerns regarding their safety remain paramount.
It remains unclear how these LLMs perform in the context of non-English languages.
arXiv Detail & Related papers (2023-10-19T20:02:40Z) - CMB: A Comprehensive Medical Benchmark in Chinese [67.69800156990952]
We propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese.
While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety.
We have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain.
arXiv Detail & Related papers (2023-08-17T07:51:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.