RuCoLA: Russian Corpus of Linguistic Acceptability
- URL: http://arxiv.org/abs/2210.12814v1
- Date: Sun, 23 Oct 2022 18:29:22 GMT
- Title: RuCoLA: Russian Corpus of Linguistic Acceptability
- Authors: Vladislav Mikhailov, Tatiana Shamardina, Max Ryabinin, Alena Pestova,
Ivan Smurov, Ekaterina Artemova
- Abstract summary: We introduce the Russian Corpus of Linguistic Acceptability (RuCoLA)
RuCoLA consists of $9.8$k in-domain sentences from linguistic publications and $3.6$k out-of-domain sentences produced by generative models.
We demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors.
- Score: 6.500438378175089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linguistic acceptability (LA) attracts the attention of the research
community due to its many uses, such as testing the grammatical knowledge of
language models and filtering implausible texts with acceptability classifiers.
However, the application scope of LA in languages other than English is limited
due to the lack of high-quality resources. To this end, we introduce the
Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up
under the well-established binary LA approach. RuCoLA consists of $9.8$k
in-domain sentences from linguistic publications and $3.6$k out-of-domain
sentences produced by generative models. The out-of-domain set is created to
facilitate the practical use of acceptability for improving language
generation. Our paper describes the data collection protocol and presents a
fine-grained analysis of acceptability classification experiments with a range
of baseline approaches. In particular, we demonstrate that the most widely used
language models still fall behind humans by a large margin, especially when
detecting morphological and semantic errors. We release RuCoLA, the code of
experiments, and a public leaderboard (rucola-benchmark.com) to assess the
linguistic competence of language models for Russian.
Related papers
- RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs [2.9521383230206966]
This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP)
RuBLiMP includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon.
We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense.
arXiv Detail & Related papers (2024-06-27T14:55:19Z) - JCoLA: Japanese Corpus of Linguistic Acceptability [3.6141428739228902]
We introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments.
We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA.
arXiv Detail & Related papers (2023-09-22T07:35:45Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - OCR Language Models with Custom Vocabularies [5.608846358903994]
This paper introduces an algorithm for efficiently generating and attaching a domain specific word based language model at run time to a general language model in an OCR system.
The paper also introduces a modified CTC beam search decoder which effectively allows hypotheses to remain in contention based on possible future completion of vocabulary words.
arXiv Detail & Related papers (2023-08-18T16:46:11Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Are Large Language Models Robust Coreference Resolvers? [17.60248310475889]
We show that prompting for coreference can outperform current unsupervised coreference systems.
Further investigations reveal that instruction-tuned LMs generalize surprisingly well across domains, languages, and time periods.
arXiv Detail & Related papers (2023-05-23T19:38:28Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.