PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain
- URL: http://arxiv.org/abs/2310.14151v1
- Date: Sun, 22 Oct 2023 02:20:38 GMT
- Title: PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain
- Authors: Wei Zhu and Xiaoling Wang and Huanran Zheng and Mosha Chen and Buzhou
Tang
- Abstract summary: We re-build the Chinese Biomedical Language Understanding Evaluation (CBlue) benchmark into a large scale prompt-tuning benchmark, PromptCBlue.
Our benchmark is a suitable test-bed and an online platform for evaluating Chinese LLMs' multi-task capabilities on a wide range bio-medical tasks.
- Score: 24.411904114158673
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Biomedical language understanding benchmarks are the driving forces for
artificial intelligence applications with large language model (LLM) back-ends.
However, most current benchmarks: (a) are limited to English which makes it
challenging to replicate many of the successes in English for other languages,
or (b) focus on knowledge probing of LLMs and neglect to evaluate how LLMs
apply these knowledge to perform on a wide range of bio-medical tasks, or (c)
have become a publicly available corpus and are leaked to LLMs during
pre-training. To facilitate the research in medical LLMs, we re-build the
Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark into a
large scale prompt-tuning benchmark, PromptCBLUE. Our benchmark is a suitable
test-bed and an online platform for evaluating Chinese LLMs' multi-task
capabilities on a wide range bio-medical tasks including medical entity
recognition, medical text classification, medical natural language inference,
medical dialogue understanding and medical content/dialogue generation. To
establish evaluation on these tasks, we have experimented and report the
results with the current 9 Chinese LLMs fine-tuned with differtent fine-tuning
techniques.
Related papers
- MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
"MedBench" is a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM.
First, MedBench assembles the largest evaluation dataset (300,901 questions) to cover 43 clinical specialties.
Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
arXiv Detail & Related papers (2024-06-24T02:25:48Z) - MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge [4.8004472307210255]
Large language models (LLMs) have excelled across domains, delivering notable performance on medical evaluation benchmarks.
However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios.
We develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge.
arXiv Detail & Related papers (2024-06-05T04:15:07Z) - MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering [8.110978727364397]
Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology.
This paper presents MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering.
arXiv Detail & Related papers (2024-04-08T15:03:57Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - CMB: A Comprehensive Medical Benchmark in Chinese [67.69800156990952]
We propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese.
While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety.
We have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain.
arXiv Detail & Related papers (2023-08-17T07:51:23Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.