RuCoCo: a new Russian corpus with coreference annotation
- URL: http://arxiv.org/abs/2206.04925v1
- Date: Fri, 10 Jun 2022 07:50:09 GMT
- Title: RuCoCo: a new Russian corpus with coreference annotation
- Authors: Vladimir Dobrovolskii, Mariia Michurina, Alexandra Ivoylova
- Abstract summary: We present a new corpus with coreference annotation, Russian Coreference Corpus (RuCoCo)
RuCoCo contains news texts in Russian, part of which were annotated from scratch, and for the rest the machine-generated annotations were refined by human annotators.
The size of our corpus is one million words and around 150,000 mentions.
- Score: 69.3939291118954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new corpus with coreference annotation, Russian Coreference
Corpus (RuCoCo). The goal of RuCoCo is to obtain a large number of annotated
texts while maintaining high inter-annotator agreement. RuCoCo contains news
texts in Russian, part of which were annotated from scratch, and for the rest
the machine-generated annotations were refined by human annotators. The size of
our corpus is one million words and around 150,000 mentions. We make the corpus
publicly available.
Related papers
- WikiNER-fr-gold: A Gold-Standard NER Corpus [1.7205106391379026]
We address the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it.
We propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER.
We present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
arXiv Detail & Related papers (2024-10-29T08:00:16Z) - The Russian Legislative Corpus [0.0]
The corpus collects all 281,413 texts (176,523,268 tokens) of non-secret federal regulations and acts, along with their metadata.
The corpus has two versions the original text with minimal preprocessing and a version prepared for linguistic analysis with morphosyntactic markup.
arXiv Detail & Related papers (2024-06-07T11:38:12Z) - KoCoNovel: Annotated Dataset of Character Coreference in Korean Novels [0.0]
KoCoNovel is a novel character coreference dataset derived from Korean literary texts.
One of KoCoNovel's distinctive features is that 24% of all character mentions are single common nouns.
arXiv Detail & Related papers (2024-04-01T14:36:35Z) - Longtonotes: OntoNotes with Longer Coreference Chains [111.73115731999793]
We build a corpus of coreference-annotated documents of significantly longer length than what is currently available.
The resulting corpus, which we call LongtoNotes, contains documents in multiple genres of the English language with varying lengths.
We evaluate state-of-the-art neural coreference systems on this new corpus.
arXiv Detail & Related papers (2022-10-07T15:58:41Z) - A Part-of-Speech Tagger for Yiddish [4.57670708264108]
This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text.
We combine two resources for the current work - an 80K-word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC)
We present some evidence that even simple non-contextualized embeddings trained on YBC are able to capture the relationships among spelling variants without the need to first "standardize" the corpus.
arXiv Detail & Related papers (2022-04-03T22:53:36Z) - Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP
models [53.95094814056337]
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models.
The new version includes a number of technical, user experience and methodological improvements.
We provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO.
arXiv Detail & Related papers (2022-02-15T23:45:30Z) - A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses.
The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.