AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large
Language Models
- URL: http://arxiv.org/abs/2403.06574v1
- Date: Mon, 11 Mar 2024 10:24:37 GMT
- Title: AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large
Language Models
- Authors: Yuting Wei, Yuanxing Xu, Xinru Wei, Simin Yang, Yangfu Zhu, Yuqing Li,
Di Liu, Bin Wu
- Abstract summary: AC-EVAL is a benchmark designed to assess the advanced knowledge and reasoning capabilities of Large Language Models (LLMs)
The benchmark comprises 13 tasks, spanning historical facts, geography, social customs, art, philosophy, classical poetry and prose.
Our evaluation of top-performing LLMs, tailored for both English and Chinese, reveals a substantial potential for enhancing ancient text comprehension.
- Score: 15.490610582567543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the importance of ancient Chinese in capturing the essence of rich
historical and cultural heritage, the rapid advancements in Large Language
Models (LLMs) necessitate benchmarks that can effectively evaluate their
understanding of ancient contexts. To meet this need, we present AC-EVAL, an
innovative benchmark designed to assess the advanced knowledge and reasoning
capabilities of LLMs within the context of ancient Chinese. AC-EVAL is
structured across three levels of difficulty reflecting different facets of
language comprehension: general historical knowledge, short text understanding,
and long text comprehension. The benchmark comprises 13 tasks, spanning
historical facts, geography, social customs, art, philosophy, classical poetry
and prose, providing a comprehensive assessment framework. Our extensive
evaluation of top-performing LLMs, tailored for both English and Chinese,
reveals a substantial potential for enhancing ancient text comprehension. By
highlighting the strengths and weaknesses of LLMs, AC-EVAL aims to promote
their development and application forward in the realms of ancient Chinese
language education and scholarly research. The AC-EVAL data and evaluation code
are available at https://github.com/yuting-wei/AC-EVAL.
Related papers
- Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models [9.761584874383873]
We present Edu-Values, the first Chinese education values evaluation benchmark designed to measure large language models' alignment ability.
We meticulously design and compile 1,418 questions, including multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and questions on traditional Chinese culture.
Due to differences in educational culture, Chinese LLMs significantly outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37.
arXiv Detail & Related papers (2024-09-19T13:02:54Z) - Benchmarking Chinese Knowledge Rectification in Large Language Models [43.9841600678381]
This paper introduces a benchmark for rectifying Chinese knowledge in Large Language Models via knowledge editing.
We collect seven type of knowledge from various sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba.
Through the analysis of this dataset, we uncover the challenges faced by current LLMs in mastering Chinese.
arXiv Detail & Related papers (2024-09-09T17:11:51Z) - FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models [64.11333762954283]
This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs.
We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses.
Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
arXiv Detail & Related papers (2024-04-29T01:49:07Z) - Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs)
TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test
on ACLUE [23.598825660594926]
ACLUE is an evaluation benchmark designed to assess the capability of language models in comprehending ancient Chinese.
We observed a noticeable disparity in their performance between modern Chinese and ancient Chinese.
ChatGLM2 demonstrates the most remarkable performance, achieving an average score of 37.4%.
arXiv Detail & Related papers (2023-10-14T10:06:39Z) - Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA)
First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios.
Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository.
Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z) - KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA)
We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks.
We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for
Foundation Models [58.42279750824907]
We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context.
C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional.
We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models.
arXiv Detail & Related papers (2023-05-15T03:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.