Measuring Massive Multitask Chinese Understanding
- URL: http://arxiv.org/abs/2304.12986v2
- Date: Mon, 15 May 2023 16:41:08 GMT
- Title: Measuring Massive Multitask Chinese Understanding
- Authors: Hui Zeng
- Abstract summary: This test encompasses four major domains, including medicine, law, psychology, and education.
The best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average.
All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239.
- Score: 16.41629318344805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The development of large-scale Chinese language models is flourishing, yet
there is a lack of corresponding capability assessments. Therefore, we propose
a test to measure the multitask accuracy of large Chinese language models. This
test encompasses four major domains, including medicine, law, psychology, and
education, with 15 subtasks in medicine and 8 subtasks in education. We found
that the best-performing models in the zero-shot setting outperformed the
worst-performing models by nearly 18.6 percentage points on average. Across the
four major domains, the highest average zero-shot accuracy of all models is
0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot
accuracy of 0.693 in clinical medicine, which was the highest accuracy among
all models across all subtasks. All models performed poorly in the legal
domain, with the highest zero-shot accuracy reaching only 0.239. By
comprehensively evaluating the breadth and depth of knowledge across multiple
disciplines, this test can more accurately identify the shortcomings of the
models.
Related papers
- A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z) - A Comprehensive Evaluation of Histopathology Foundation Models for Ovarian Cancer Subtype Classification [1.9499122087408571]
Histopathology foundation models show great promise across many tasks.
We report the most rigorous single-task validation of histopathology foundation models to date.
Histopathology foundation models offer a clear benefit to ovarian cancer subtyping.
arXiv Detail & Related papers (2024-05-16T11:21:02Z) - Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks [17.40940406100025]
We introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters.
Our systems achieved remarkable accuracy across six medical benchmarks.
Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8.
arXiv Detail & Related papers (2024-03-30T14:09:00Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively.
Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol.
We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on
a Massive Scale [64.11709427403008]
We study the zero-shot transfer capabilities of text matching models on a massive scale, by self-supervised training on 140 source domains.
We show that all 140 models transfer surprisingly well, where the large majority of models substantially outperforms common IR baselines.
arXiv Detail & Related papers (2020-10-02T13:22:12Z) - Measuring Massive Multitask Language Understanding [79.6985576698597]
The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
Largest GPT-3 model improves over random chance by almost 20 percentage points on average.
Models also have lopsided performance and frequently do not know when they are wrong.
arXiv Detail & Related papers (2020-09-07T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.