CMMLU: Measuring massive multitask language understanding in Chinese
- URL: http://arxiv.org/abs/2306.09212v2
- Date: Wed, 17 Jan 2024 19:09:57 GMT
- Title: CMMLU: Measuring massive multitask language understanding in Chinese
- Authors: Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao
and Yeyun Gong and Nan Duan and Timothy Baldwin
- Abstract summary: This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
- Score: 133.70911295934746
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As the capabilities of large language models (LLMs) continue to advance,
evaluating their performance becomes increasingly crucial and challenging. This
paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese
benchmark that covers various subjects, including natural science, social
sciences, engineering, and humanities. We conduct a thorough evaluation of 18
advanced multilingual- and Chinese-oriented LLMs, assessing their performance
across different subjects and settings. The results reveal that most existing
LLMs struggle to achieve an average accuracy of 50%, even when provided with
in-context examples and chain-of-thought prompts, whereas the random baseline
stands at 25%. This highlights significant room for improvement in LLMs.
Additionally, we conduct extensive experiments to identify factors impacting
the models' performance and propose directions for enhancing LLMs. CMMLU fills
the gap in evaluating the knowledge and reasoning capabilities of large
language models within the Chinese context.
Related papers
- Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs)
TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z) - Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [34.07537926291133]
CHARM is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese.
We evaluated 7 English and 12 Chinese-oriented LLMs on CHARM.
Some LLMs struggle with memorizing Chinese commonsense, affecting their reasoning ability, while others show differences in reasoning despite similar performance.
arXiv Detail & Related papers (2024-03-21T03:52:01Z) - Analyzing and Adapting Large Language Models for Few-Shot Multilingual
NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by
Dissociating Language and Cognition [57.747888532651]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - On the (In)Effectiveness of Large Language Models for Chinese Text
Correction [44.32102000125604]
Large Language Models (LLMs) have amazed the entire Artificial Intelligence community.
This study focuses on Chinese Text Correction, a fundamental and challenging Chinese NLP task.
We empirically find that the LLMs currently have both amazing performance and unsatisfactory behavior for Chinese Text Correction.
arXiv Detail & Related papers (2023-07-18T06:48:52Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.