Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test
on ACLUE
- URL: http://arxiv.org/abs/2310.09550v1
- Date: Sat, 14 Oct 2023 10:06:39 GMT
- Title: Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test
on ACLUE
- Authors: Yixuan Zhang and Haonan Li
- Abstract summary: ACLUE is an evaluation benchmark designed to assess the capability of language models in comprehending ancient Chinese.
We observed a noticeable disparity in their performance between modern Chinese and ancient Chinese.
ChatGLM2 demonstrates the most remarkable performance, achieving an average score of 37.4%.
- Score: 23.598825660594926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have showcased remarkable capabilities in
understanding and generating language. However, their ability in comprehending
ancient languages, particularly ancient Chinese, remains largely unexplored. To
bridge this gap, we present ACLUE, an evaluation benchmark designed to assess
the capability of language models in comprehending ancient Chinese. ACLUE
consists of 15 tasks cover a range of skills, spanning phonetic, lexical,
syntactic, semantic, inference and knowledge. Through the evaluation of eight
state-of-the-art LLMs, we observed a noticeable disparity in their performance
between modern Chinese and ancient Chinese. Among the assessed models, ChatGLM2
demonstrates the most remarkable performance, achieving an average score of
37.4%. We have made our code and data public available.
Related papers
- Benchmarking Chinese Knowledge Rectification in Large Language Models [43.9841600678381]
This paper introduces a benchmark for rectifying Chinese knowledge in Large Language Models via knowledge editing.
We collect seven type of knowledge from various sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba.
Through the analysis of this dataset, we uncover the challenges faced by current LLMs in mastering Chinese.
arXiv Detail & Related papers (2024-09-09T17:11:51Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs)
TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z) - AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large
Language Models [15.490610582567543]
AC-EVAL is a benchmark designed to assess the advanced knowledge and reasoning capabilities of Large Language Models (LLMs)
The benchmark comprises 13 tasks, spanning historical facts, geography, social customs, art, philosophy, classical poetry and prose.
Our evaluation of top-performing LLMs, tailored for both English and Chinese, reveals a substantial potential for enhancing ancient text comprehension.
arXiv Detail & Related papers (2024-03-11T10:24:37Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Establishing Vocabulary Tests as a Benchmark for Evaluating Large
Language Models [2.7013338932521416]
We advocate for the revival of vocabulary tests as a valuable tool for assessing Large Language Models (LLMs) performance.
We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge.
arXiv Detail & Related papers (2023-10-23T08:45:12Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - WYWEB: A NLP Evaluation Benchmark For Classical Chinese [10.138128038929237]
We introduce the WYWEB evaluation benchmark, which consists of nine NLP tasks in classical Chinese.
We evaluate the existing pre-trained language models, which are all struggling with this benchmark.
arXiv Detail & Related papers (2023-05-23T15:15:11Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.