CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark
- URL: http://arxiv.org/abs/2112.13610v1
- Date: Mon, 27 Dec 2021 11:08:58 GMT
- Title: CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark
- Authors: Yuan Yao, Qingxiu Dong, Jian Guan, Boxi Cao, Zhengyan Zhang, Chaojun
Xiao, Xiaozhi Wang, Fanchao Qi, Junwei Bao, Jinran Nie, Zheni Zeng, Yuxian
Gu, Kun Zhou, Xuancheng Huang, Wenhao Li, Shuhuai Ren, Jinliang Lu,
Chengqiang Xu, Huadong Wang, Guoyang Zeng, Zile Zhou, Jiajun Zhang, Juanzi
Li, Minlie Huang, Rui Yan, Xiaodong He, Xiaojun Wan, Xin Zhao, Xu Sun, Yang
Liu, Zhiyuan Liu, Xianpei Han, Erhong Yang, Zhifang Sui, Maosong Sun
- Abstract summary: General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
- Score: 144.05723617401674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Realizing general-purpose language intelligence has been a longstanding goal
for natural language processing, where standard evaluation benchmarks play a
fundamental and guiding role. We argue that for general-purpose language
intelligence evaluation, the benchmark itself needs to be comprehensive and
systematic. To this end, we propose CUGE, a Chinese Language Understanding and
Generation Evaluation benchmark with the following features: (1) Hierarchical
benchmark framework, where datasets are principally selected and organized with
a language capability-task-dataset hierarchy. (2) Multi-level scoring strategy,
where different levels of model performance are provided based on the
hierarchical framework. To facilitate CUGE, we provide a public leaderboard
that can be customized to support flexible model judging criteria. Evaluation
results on representative pre-trained language models indicate ample room for
improvement towards general-purpose language intelligence. CUGE is publicly
available at cuge.baai.ac.cn.
Related papers
- The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level [23.833528781431884]
Social Simulation Tasks in Sandbox (STSS) benchmark is a language-level benchmark for multi-agent simulation.
Our evaluative findings highlight that the STSS benchmark is challenging for state-of-the-art language agents.
arXiv Detail & Related papers (2024-04-08T09:25:32Z) - NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems [2.141587359797428]
It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries.
Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools.
The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark.
arXiv Detail & Related papers (2024-03-07T14:07:00Z) - On Using Distribution-Based Compositionality Assessment to Evaluate
Compositional Generalisation in Machine Translation [10.840893953881652]
It is important to develop benchmarks to assess compositional generalisation in real-world natural language tasks.
This is done by splitting the Europarl translation corpus into a training and a test set in such a way that the test set requires compositional generalisation capacity.
This is a fully-automated procedure to create natural language compositionality benchmarks, making it simple and inexpensive to apply it further to other datasets and languages.
arXiv Detail & Related papers (2023-11-14T15:37:19Z) - Advancing the Evaluation of Traditional Chinese Language Models: Towards
a Comprehensive Benchmark Suite [17.764840326809797]
We propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese.
These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding.
In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks.
arXiv Detail & Related papers (2023-09-15T14:52:23Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Compositional Generalization in Grounded Language Learning via Induced
Model Sparsity [81.38804205212425]
We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations.
We design an agent that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal.
Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations.
arXiv Detail & Related papers (2022-07-06T08:46:27Z) - Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers [0.05857406612420462]
Large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks.
We propose evaluating systems through a novel measure of prediction coherence.
arXiv Detail & Related papers (2021-09-10T15:04:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.