SLING: Sino Linguistic Evaluation of Large Language Models
- URL: http://arxiv.org/abs/2210.11689v1
- Date: Fri, 21 Oct 2022 02:29:39 GMT
- Title: SLING: Sino Linguistic Evaluation of Large Language Models
- Authors: Yixiao Song, Kalpesh Krishna, Rajesh Bhatt and Mohit Iyyer
- Abstract summary: Sino LINGuistics (SLING) consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena.
We test 18 publicly available pretrained monolingual (e.g., BERT-base-zh) and multi-lingual (e.g., mT5, XLM) language models on SLING.
Our experiments show that the average accuracy for LMs is far below human performance (69.7% vs. 97.1%), while BERT-base-zh achieves the highest accuracy (84.8%) of all tested LMs, even much larger ones.
- Score: 34.42512869432145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To understand what kinds of linguistic knowledge are encoded by pretrained
Chinese language models (LMs), we introduce the benchmark of Sino LINGuistics
(SLING), which consists of 38K minimal sentence pairs in Mandarin Chinese
grouped into 9 high-level linguistic phenomena. Each pair demonstrates the
acceptability contrast of a specific syntactic or semantic phenomenon (e.g.,
The keys are lost vs. The keys is lost), and an LM should assign lower
perplexity to the acceptable sentence. In contrast to the CLiMP dataset (Xiang
et al., 2021), which also contains Chinese minimal pairs and was created by
translating the vocabulary of the English BLiMP dataset, the minimal pairs in
SLING are derived primarily by applying syntactic and lexical transformations
to naturally-occurring, linguist-annotated sentences from the Chinese Treebank
9.0, thus addressing severe issues in CLiMP's data generation process. We test
18 publicly available pretrained monolingual (e.g., BERT-base-zh, CPM) and
multi-lingual (e.g., mT5, XLM) language models on SLING. Our experiments show
that the average accuracy for LMs is far below human performance (69.7% vs.
97.1%), while BERT-base-zh achieves the highest accuracy (84.8%) of all tested
LMs, even much larger ones. Additionally, we find that most LMs have a strong
gender and number (singular/plural) bias, and they perform better on local
phenomena than hierarchical ones.
Related papers
- Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models [11.287933170894311]
We construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs.
We propose a method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses.
arXiv Detail & Related papers (2024-07-02T03:16:47Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - Native Language Identification with Large Language Models [60.80452362519818]
We show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark11 test set in a zero-shot setting.
We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes.
arXiv Detail & Related papers (2023-12-13T00:52:15Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models [57.225289079198454]
We propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora.
Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexico, genealogical language family, and geographical sprachbund.
We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks.
arXiv Detail & Related papers (2023-05-23T04:44:26Z) - Sort by Structure: Language Model Ranking as Dependency Probing [25.723591566201343]
Making an informed choice of pre-trained language model (LM) is critical for performance, yet environmentally costly, and as such widely underexplored.
We propose probing to rank LMs, specifically for parsing dependencies in a given language, by measuring the degree to which labeled trees are recoverable from an LM's contextualized embeddings.
Across 46 typologically and architecturally diverse LM-language pairs, our approach predicts the best LM choice of 79% of orders of less compute than training a full magnitude of orders of less compute.
arXiv Detail & Related papers (2022-06-10T08:10:29Z) - CLiMP: A Benchmark for Chinese Language Model Evaluation [17.13061722469761]
We introduce the corpus of Chinese linguistic minimal pairs (CLiMP)
CLiMP consists of sets of 1,000 minimal pairs (MPs) for 16 syntactic contrasts in Mandarin, covering 9 major Mandarin linguistic phenomena.
We evaluate 11 different LMs on CLiMP, covering n-grams, LSTMs, and Chinese BERT.
arXiv Detail & Related papers (2021-01-26T23:16:29Z) - BLiMP: The Benchmark of Linguistic Minimal Pairs for English [23.2834990762859]
The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP) is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English.
BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics.
We find that state-of-the-art models identify morphological contrasts reliably, but they struggle with semantic restrictions on the distribution of quantifiers and negative polarity items.
arXiv Detail & Related papers (2019-12-02T05:42:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.