CLiMP: A Benchmark for Chinese Language Model Evaluation
- URL: http://arxiv.org/abs/2101.11131v1
- Date: Tue, 26 Jan 2021 23:16:29 GMT
- Title: CLiMP: A Benchmark for Chinese Language Model Evaluation
- Authors: Beilei Xiang, Changbing Yang, Yu Li, Alex Warstadt and Katharina Kann
- Abstract summary: We introduce the corpus of Chinese linguistic minimal pairs (CLiMP)
CLiMP consists of sets of 1,000 minimal pairs (MPs) for 16 syntactic contrasts in Mandarin, covering 9 major Mandarin linguistic phenomena.
We evaluate 11 different LMs on CLiMP, covering n-grams, LSTMs, and Chinese BERT.
- Score: 17.13061722469761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linguistically informed analyses of language models (LMs) contribute to the
understanding and improvement of these models. Here, we introduce the corpus of
Chinese linguistic minimal pairs (CLiMP), which can be used to investigate what
knowledge Chinese LMs acquire. CLiMP consists of sets of 1,000 minimal pairs
(MPs) for 16 syntactic contrasts in Mandarin, covering 9 major Mandarin
linguistic phenomena. The MPs are semi-automatically generated, and human
agreement with the labels in CLiMP is 95.8%. We evaluated 11 different LMs on
CLiMP, covering n-grams, LSTMs, and Chinese BERT. We find that classifier-noun
agreement and verb complement selection are the phenomena that models generally
perform best at. However, models struggle the most with the ba construction,
binding, and filler-gap dependencies. Overall, Chinese BERT achieves an 81.8%
average accuracy, while the performances of LSTMs and 5-grams are only
moderately above chance level.
Related papers
- Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model [36.01840141194335]
We introduce CT-LLM, a 2B large language model (LLM)
Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by incorporating Chinese textual data.
CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT.
arXiv Detail & Related papers (2024-04-05T15:20:02Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references.
For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z) - ChatGPT MT: Competitive for High- (but not Low-) Resource Languages [62.178282377729566]
Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT)
We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis.
Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it.
arXiv Detail & Related papers (2023-09-14T04:36:00Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models [57.225289079198454]
We propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora.
Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexico, genealogical language family, and geographical sprachbund.
We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks.
arXiv Detail & Related papers (2023-05-23T04:44:26Z) - Massively Multilingual Shallow Fusion with Large Language Models [62.76735265311028]
We train a single multilingual language model (LM) for shallow fusion in multiple languages.
Compared to a dense LM of similar computation during inference, GLaM reduces the WER of an English long-tail test set by 4.4% relative.
In a multilingual shallow fusion task, GLaM improves 41 out of 50 languages with an average relative WER reduction of 3.85%, and a maximum reduction of 10%.
arXiv Detail & Related papers (2023-02-17T14:46:38Z) - SLING: Sino Linguistic Evaluation of Large Language Models [34.42512869432145]
Sino LINGuistics (SLING) consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena.
We test 18 publicly available pretrained monolingual (e.g., BERT-base-zh) and multi-lingual (e.g., mT5, XLM) language models on SLING.
Our experiments show that the average accuracy for LMs is far below human performance (69.7% vs. 97.1%), while BERT-base-zh achieves the highest accuracy (84.8%) of all tested LMs, even much larger ones.
arXiv Detail & Related papers (2022-10-21T02:29:39Z) - Cross-Linguistic Syntactic Evaluation of Word Prediction Models [25.39896327641704]
We investigate how neural word prediction models' ability to learn syntax varies by language.
CLAMS includes subject-verb agreement challenge sets for English, French, German, Hebrew and Russian.
We use CLAMS to evaluate LSTM language models as well as monolingual and multilingual BERT.
arXiv Detail & Related papers (2020-05-01T02:51:20Z) - BLiMP: The Benchmark of Linguistic Minimal Pairs for English [23.2834990762859]
The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP) is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English.
BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics.
We find that state-of-the-art models identify morphological contrasts reliably, but they struggle with semantic restrictions on the distribution of quantifiers and negative polarity items.
arXiv Detail & Related papers (2019-12-02T05:42:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.