The Russian Legislative Corpus
- URL: http://arxiv.org/abs/2406.04855v2
- Date: Mon, 28 Oct 2024 12:07:49 GMT
- Title: The Russian Legislative Corpus
- Authors: Denis Saveliev, Ruslan Kuchakov,
- Abstract summary: The corpus collects all 281,413 texts (176,523,268 tokens) of non-secret federal regulations and acts, along with their metadata.
The corpus has two versions the original text with minimal preprocessing and a version prepared for linguistic analysis with morphosyntactic markup.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present the comprehensive Russian primary and secondary legislation corpus covering 1991 to 2023. The corpus collects all 281,413 texts (176,523,268 tokens) of non-secret federal regulations and acts, along with their metadata. The corpus has two versions the original text with minimal preprocessing and a version prepared for linguistic analysis with morphosyntactic markup.
Related papers
- Multilingual and Explainable Text Detoxification with Parallel Corpora [58.83211571400692]
We extend parallel text detoxification corpus to new languages.
We conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences.
We then experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach.
arXiv Detail & Related papers (2024-12-16T12:08:59Z) - The SAMER Arabic Text Simplification Corpus [9.369209124775043]
SAMER Corpus is the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners.
Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels published between 1865 and 1955.
Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels.
arXiv Detail & Related papers (2024-04-29T11:34:06Z) - Exploring Methods for Cross-lingual Text Style Transfer: The Case of
Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral.
We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - RuCoCo: a new Russian corpus with coreference annotation [69.3939291118954]
We present a new corpus with coreference annotation, Russian Coreference Corpus (RuCoCo)
RuCoCo contains news texts in Russian, part of which were annotated from scratch, and for the rest the machine-generated annotations were refined by human annotators.
The size of our corpus is one million words and around 150,000 mentions.
arXiv Detail & Related papers (2022-06-10T07:50:09Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses.
The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z) - Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged
Amharic Corpus [0.04915744683251149]
Amharic corpus is partly a web corpus.
Texts are collected from 25,199 documents from different domains.
About 24 million orthographic words are tokenized.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual
Morpheme Alignment [0.0]
Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts were constructed manually.
This paper describes a nontrivial editorial process starting from the creation of the original one-purpose database.
It ends with its reconstruction using only freely available text editions and annotations.
arXiv Detail & Related papers (2020-03-16T22:10:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.