CLUE: A Chinese Language Understanding Evaluation Benchmark
- URL: http://arxiv.org/abs/2004.05986v3
- Date: Thu, 5 Nov 2020 14:46:45 GMT
- Title: CLUE: A Chinese Language Understanding Evaluation Benchmark
- Authors: Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen
Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi,
Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina
Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng
Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson and Zhenzhong
Lan
- Abstract summary: We introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark.
CLUE brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension.
We report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models.
- Score: 41.86950255312653
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The advent of natural language understanding (NLU) benchmarks for English,
such as GLUE and SuperGLUE allows new NLU models to be evaluated across a
diverse set of tasks. These comprehensive benchmarks have facilitated a broad
range of research and applications in natural language processing (NLP). The
problem, however, is that most such benchmarks are limited to English, which
has made it difficult to replicate many of the successes in English NLU for
other languages. To help remedy this issue, we introduce the first large-scale
Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an
open-ended, community-driven project that brings together 9 tasks spanning
several well-established single-sentence/sentence-pair classification tasks, as
well as machine reading comprehension, all on original Chinese text. To
establish results on these tasks, we report scores using an exhaustive set of
current state-of-the-art pre-trained Chinese models (9 in total). We also
introduce a number of supplementary datasets and additional tools to help
facilitate further progress on Chinese NLU. Our benchmark is released at
https://www.CLUEbenchmarks.com
Related papers
- Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP)
When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs.
It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark [28.472036496534116]
bgGLUE is a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian.
We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark.
arXiv Detail & Related papers (2023-06-04T12:54:00Z) - WYWEB: A NLP Evaluation Benchmark For Classical Chinese [10.138128038929237]
We introduce the WYWEB evaluation benchmark, which consists of nine NLP tasks in classical Chinese.
We evaluate the existing pre-trained language models, which are all struggling with this benchmark.
arXiv Detail & Related papers (2023-05-23T15:15:11Z) - This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP.
We use five datasets from the Polish benchmark and add eight novel datasets.
We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark [8.158067688043554]
This work first introduces Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive small sample evaluation benchmark in Chinese.
An unlabeled training set with up to 20,000 additional samples per task is provided, allowing researchers to explore better ways of using unlabeled samples.
Next, we implement a set of state-of-the-art few-shot learning methods, and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark.
arXiv Detail & Related papers (2021-07-15T17:51:25Z) - ParsiNLU: A Suite of Language Understanding Challenges for Persian [23.26176232463948]
This work focuses on Persian language, one of the widely spoken languages in the world.
There are few NLU datasets available for this rich language.
ParsiNLU is the first benchmark in Persian language that includes a range of high-level tasks.
arXiv Detail & Related papers (2020-12-11T06:31:42Z) - FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding [55.38905499274026]
Few-shot learning is one of the key future steps in machine learning.
FewJoint is a novel Few-Shot Learning benchmark for NLP.
arXiv Detail & Related papers (2020-09-17T08:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.