Related papers: The Russian Legislative Corpus

The Russian Legislative Corpus

URL: http://arxiv.org/abs/2406.04855v2
Date: Mon, 28 Oct 2024 12:07:49 GMT
Title: The Russian Legislative Corpus
Authors: Denis Saveliev, Ruslan Kuchakov,
Abstract summary: The corpus collects all 281,413 texts (176,523,268 tokens) of non-secret federal regulations and acts, along with their metadata. The corpus has two versions the original text with minimal preprocessing and a version prepared for linguistic analysis with morphosyntactic markup.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present the comprehensive Russian primary and secondary legislation corpus covering 1991 to 2023. The corpus collects all 281,413 texts (176,523,268 tokens) of non-secret federal regulations and acts, along with their metadata. The corpus has two versions the original text with minimal preprocessing and a version prepared for linguistic analysis with morphosyntactic markup.

Related papers

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts [0.0]
SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents.<n>The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014.<n>The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content.
arXiv Detail & Related papers (2026-03-05T06:13:44Z)
Targum -- A Multilingual New Testament Translation Corpus [46.390064640459]
We introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102)<n>Each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision.<n>This canonicalization empowers researchers to define "uniqueness" for their own needs.
arXiv Detail & Related papers (2026-02-10T12:27:57Z)
Multilingual and Explainable Text Detoxification with Parallel Corpora [58.83211571400692]
We extend parallel text detoxification corpus to new languages. We conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences. We then experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach.
arXiv Detail & Related papers (2024-12-16T12:08:59Z)
The SAMER Arabic Text Simplification Corpus [9.369209124775043]
SAMER Corpus is the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels.
arXiv Detail & Related papers (2024-04-29T11:34:06Z)
Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral. We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z)
A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics. Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z)
RuCoCo: a new Russian corpus with coreference annotation [69.3939291118954]
We present a new corpus with coreference annotation, Russian Coreference Corpus (RuCoCo) RuCoCo contains news texts in Russian, part of which were annotated from scratch, and for the rest the machine-generated annotations were refined by human annotators. The size of our corpus is one million words and around 150,000 mentions.
arXiv Detail & Related papers (2022-06-10T07:50:09Z)
The Open corpus of the Veps and Karelian languages: overview and applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search. Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z)
A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses. The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z)
Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus [0.04915744683251149]
Amharic corpus is partly a web corpus. Texts are collected from 25,199 documents from different domains. About 24 million orthographic words are tokenized.
arXiv Detail & Related papers (2021-06-14T08:49:52Z)
An analysis of full-size Russian complexly NER labelled corpus of Internet user reviews on the drugs based on deep learning and language neural nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews. A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z)
HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment [0.0]
Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts were constructed manually. This paper describes a nontrivial editorial process starting from the creation of the original one-purpose database. It ends with its reconstruction using only freely available text editions and annotations.
arXiv Detail & Related papers (2020-03-16T22:10:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.