Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
- URL: http://arxiv.org/abs/2303.18027v2
- Date: Wed, 5 Apr 2023 07:53:37 GMT
- Title: Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
- Authors: Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, Dragomir
Radev
- Abstract summary: We evaluate large language models (LLMs) on the Japanese national medical licensing examinations from the past five years, including the current year.
Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all six years of the exams.
We release our benchmark as Igaku QA as well as all model outputs and exam metadata.
- Score: 22.31663814655303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) gain popularity among speakers of diverse
languages, we believe that it is crucial to benchmark them to better understand
model behaviors, failures, and limitations in languages beyond English. In this
work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national
medical licensing examinations from the past five years, including the current
year. Our team comprises native Japanese-speaking NLP researchers and a
practicing cardiologist based in Japan. Our experiments show that GPT-4
outperforms ChatGPT and GPT-3 and passes all six years of the exams,
highlighting LLMs' potential in a language that is typologically distant from
English. However, our evaluation also exposes critical limitations of the
current LLM APIs. First, LLMs sometimes select prohibited choices that should
be strictly avoided in medical practice in Japan, such as suggesting
euthanasia. Further, our analysis shows that the API costs are generally higher
and the maximum context size is smaller for Japanese because of the way
non-Latin scripts are currently tokenized in the pipeline. We release our
benchmark as Igaku QA as well as all model outputs and exam metadata. We hope
that our results and benchmark will spur progress on more diverse applications
of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.
Related papers
- One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [55.35278531907263]
We present the first study on Large Language Models' fairness and robustness to a dialect in canonical reasoning tasks.
We hire AAVE speakers to rewrite seven popular benchmarks, such as HumanEval and GSM8K.
We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources [0.0]
We present a medical adaptation based on the recent 7B models, which enables the operation in low computational resources.
We find that fine-tuning an English-centric base model on Japanese medical dataset improves the score in both language.
arXiv Detail & Related papers (2024-09-18T08:07:37Z) - 70B-parameter large language models in Japanese medical question-answering [0.0]
We show that instruction tuning using Japanese medical question-answering dataset significantly improves the ability of Japanese LLMs to solve Japanese medical license exams.
In particular, the Japanese-centric models exhibit a more significant leap in improvement through instruction tuning compared to their English-centric counterparts.
arXiv Detail & Related papers (2024-06-21T06:04:10Z) - Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data [3.471944921180245]
We developed a fictional medical benchmark focused on a non-existent gland, the Glianorex.
This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities.
We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting.
arXiv Detail & Related papers (2024-06-04T15:08:56Z) - MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering [8.110978727364397]
Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology.
This paper presents MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering.
arXiv Detail & Related papers (2024-04-08T15:03:57Z) - Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT [4.574416868427695]
This paper explores the efficacy of large language models (LLMs) for Persian.
We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks.
arXiv Detail & Related papers (2024-04-03T02:12:29Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - Native Language Identification with Large Language Models [60.80452362519818]
We show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark11 test set in a zero-shot setting.
We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes.
arXiv Detail & Related papers (2023-12-13T00:52:15Z) - HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [61.41790586411816]
HuatuoGPT-II has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks.
It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine.
arXiv Detail & Related papers (2023-11-16T10:56:24Z) - Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT)
This paper systematically investigates the advantages and challenges of LLMs for MMT.
We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.