Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged
Amharic Corpus
- URL: http://arxiv.org/abs/2106.07241v1
- Date: Mon, 14 Jun 2021 08:49:52 GMT
- Title: Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged
Amharic Corpus
- Authors: Andargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser and
Andreas N\"urnberger
- Abstract summary: Amharic corpus is partly a web corpus.
Texts are collected from 25,199 documents from different domains.
About 24 million orthographic words are tokenized.
- Score: 0.04915744683251149
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduced the contemporary Amharic corpus, which is automatically tagged
for morpho-syntactic information. Texts are collected from 25,199 documents
from different domains and about 24 million orthographic words are tokenized.
Since it is partly a web corpus, we made some automatic spelling error
correction. We have also modified the existing morphological analyzer,
HornMorpho, to use it for the automatic tagging.
Related papers
- QuranMorph: Morphologically Annotated Quranic Corpus [0.0]
QuranMorph is a morphologically annotated corpus for the Quran.<n>The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database.<n>The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset.
arXiv Detail & Related papers (2025-06-22T19:34:09Z) - WikiNER-fr-gold: A Gold-Standard NER Corpus [1.7205106391379026]
We address the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it.
We propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER.
We present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
arXiv Detail & Related papers (2024-10-29T08:00:16Z) - The Russian Legislative Corpus [0.0]
The corpus collects all 281,413 texts (176,523,268 tokens) of non-secret federal regulations and acts, along with their metadata.
The corpus has two versions the original text with minimal preprocessing and a version prepared for linguistic analysis with morphosyntactic markup.
arXiv Detail & Related papers (2024-06-07T11:38:12Z) - Understanding the effects of word-level linguistic annotations in
under-resourced neural machine translation [0.0]
This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation.
Part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics.
arXiv Detail & Related papers (2024-01-29T11:39:46Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative.
In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level.
For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study.
Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z) - Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks [9.160401226886947]
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech.
The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services.
We present the collection process and the collected corpus, and showcase its versatility through multiple use cases.
arXiv Detail & Related papers (2022-03-24T07:50:25Z) - A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses.
The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - Validation and Normalization of DCS corpus using Sanskrit Heritage tools
to build a tagged Gold Corpus [0.0]
The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging.
The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses.
arXiv Detail & Related papers (2020-05-13T19:23:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.