Razmecheno: Named Entity Recognition from Digital Archive of Diaries
"Prozhito"
- URL: http://arxiv.org/abs/2201.09997v1
- Date: Mon, 24 Jan 2022 23:06:01 GMT
- Title: Razmecheno: Named Entity Recognition from Digital Archive of Diaries
"Prozhito"
- Authors: Timofey Atnashev, Veronika Ganeeva, Roman Kazakov, Daria Matyash,
Michael Sonkin, Ekaterina Voloshina, Oleg Serikov, Ekaterina Artemova
- Abstract summary: This paper aims to create a novel dataset "Razmecheno", gathered from the diary texts of the project "Prozhito" in Russian.
Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika.
- Score: 1.4823641127537543
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The vast majority of existing datasets for Named Entity Recognition (NER) are
built primarily on news, research papers and Wikipedia with a few exceptions,
created from historical and literary texts. What is more, English is the main
source for data for further labelling. This paper aims to fill in multiple gaps
by creating a novel dataset "Razmecheno", gathered from the diary texts of the
project "Prozhito" in Russian. Our dataset is of interest for multiple research
lines: literary studies of diary texts, transfer learning from other domains,
low-resource or cross-lingual named entity recognition. Razmecheno comprises
1331 sentences and 14119 tokens, sampled from diaries, written during the
Perestroika. The annotation schema consists of five commonly used entity tags:
person, characteristics, location, organisation, and facility. The labelling is
carried out on the crowdsourcing platfrom Yandex.Toloka in two stages. First,
workers selected sentences, which contain an entity of particular type. Second,
they marked up entity spans. As a result 1113 entities were obtained. Empirical
evaluation of Razmecheno is carried out with off-the-shelf NER tools and by
fine-tuning pre-trained contextualized encoders. We release the annotated
dataset for open access.
Related papers
- Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis [4.660229623034816]
The Nuremberg Letterbooks dataset comprises historical documents from the early 15th century.
The dataset includes 4 books containing 1711 labeled pages written by 10 scribes.
arXiv Detail & Related papers (2024-11-11T17:08:40Z) - BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation.
We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses.
Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z) - A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents.
The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian.
It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z) - MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science
Domain [1.209268134212644]
Classifying the Argumentative Zone (AZ) has been proposed to improve processing of scholarly documents.
We present and release a new dataset of 50 manually annotated research articles.
arXiv Detail & Related papers (2023-07-05T14:55:18Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.