The Open corpus of the Veps and Karelian languages: overview and
applications
- URL: http://arxiv.org/abs/2206.03870v1
- Date: Wed, 8 Jun 2022 13:05:50 GMT
- Title: The Open corpus of the Veps and Karelian languages: overview and
applications
- Authors: Tatyana Boyko, Nina Zaitseva, Natalia Krizhanovskaya, Andrew
Krizhanovsky, Irina Novak, Nataliya Pellinen and Aleksandra Rodionova
- Abstract summary: The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
- Score: 52.77024349608834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A growing priority in the study of Baltic-Finnic languages of the Republic of
Karelia has been the methods and tools of corpus linguistics. Since 2016,
linguists, mathematicians, and programmers at the Karelian Research Centre have
been working with the Open Corpus of the Veps and Karelian Languages (VepKar),
which is an extension of the Veps Corpus created in 2009. The VepKar corpus
comprises texts in Karelian and Veps, multifunctional dictionaries linked to
them, and software with an advanced system of search using various criteria of
the texts (language, genre, etc.) and numerous linguistic categories (lexical
and grammatical search in texts was implemented thanks to the generator of word
forms that we created earlier). A corpus of 3000 texts was compiled, texts were
uploaded and marked up, the system for classifying texts into languages,
dialects, types and genres was introduced, and the word-form generator was
created. Future plans include developing a speech module for working with audio
recordings and a syntactic tagging module using morphological analysis outputs.
Owing to continuous functional advancements in the corpus manager and ongoing
VepKar enrichment with new material and text markup, users can handle a wide
range of scientific and applied tasks. In creating the universal national
VepKar corpus, its developers and managers strive to preserve and exhibit as
fully as possible the state of the Veps and Karelian languages in the 19th-21st
centuries.
Related papers
- ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts [0.0]
We present the development and deployment of a linguistic corpus from Twitter posts in English.
The main goal was to create a fully annotated English corpus for linguistic analysis.
We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n- grams.
arXiv Detail & Related papers (2024-07-22T04:48:04Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - LiMe: a Latin Corpus of Late Medieval Criminal Sentences [39.26357402982764]
We present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani.
arXiv Detail & Related papers (2024-04-19T12:06:28Z) - Carolina: a General Corpus of Contemporary Brazilian Portuguese with
Provenance, Typology and Versioning Information [0.629199190108771]
Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology.
Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types.
arXiv Detail & Related papers (2023-03-28T16:09:40Z) - Creating a morphological and syntactic tagged corpus for the Uzbek
language [0.0]
We develop a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and morphologically tagged corpus of the Uzbek language.
Based on the developed annotation tool and the software, we share our experience results of the first stage of tagged corpus creation.
arXiv Detail & Related papers (2022-10-27T07:44:12Z) - Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks [9.160401226886947]
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech.
The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services.
We present the collection process and the collected corpus, and showcase its versatility through multiple use cases.
arXiv Detail & Related papers (2022-03-24T07:50:25Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.