Corpus and Models for Lemmatisation and POS-tagging of Old French
- URL: http://arxiv.org/abs/2109.11442v1
- Date: Thu, 23 Sep 2021 15:32:41 GMT
- Title: Corpus and Models for Lemmatisation and POS-tagging of Old French
- Authors: Jean-Baptiste Camps, Thibault Cl\'erice, Fr\'ed\'eric Duval, Lucence
Ing, Naomi Kanaoka and Ariane Pinche
- Abstract summary: We present the current results of a long going project providing lemmatisation andPOS models for Old French.
We describe how we broached the difficult question of providing lemmatisation andPOS models for Old French with the help of neural taggers and the progressive constitution of dedicated corpora.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Old French is a typical example of an under-resourced historic languages,
that furtherly displays animportant amount of linguistic variation. In this
paper, we present the current results of a long going project (2015-...) and
describe how we broached the difficult question of providing lemmatisation
andPOS models for Old French with the help of neural taggers and the
progressive constitution of dedicated corpora.
Related papers
- Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - Masked Part-Of-Speech Model: Does Modeling Long Context Help
Unsupervised POS-tagging? [94.68962249604749]
We propose a Masked Part-of-Speech Model (MPoSM) to facilitate flexible dependency modeling.
MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction.
We achieve competitive results on both the English Penn WSJ dataset and the universal treebank containing 10 diverse languages.
arXiv Detail & Related papers (2022-06-30T01:43:05Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - Cedille: A large autoregressive French language model [0.21756081703276003]
We introduce Cedille, a large open source auto-regressive language model, specifically trained for the French language.
Our results show that Cedille outperforms existing French language models and is competitive with GPT-3 on a range of French zero-shot benchmarks.
arXiv Detail & Related papers (2022-02-07T17:40:43Z) - PAGnol: An Extra-Large French Generative Model [53.40189314359048]
We introduce PAGnol, a collection of French GPT models.
Using scaling laws, we efficiently train PAGnol-XL with the same computational budget as CamemBERT.
arXiv Detail & Related papers (2021-10-16T11:44:23Z) - On Language Models for Creoles [8.577162764242845]
Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature.
What grammatical and lexical features are transferred to the creole is a complex process.
While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations.
arXiv Detail & Related papers (2021-09-13T15:51:15Z) - Standardizing linguistic data: method and tools for annotating
(pre-orthographic) French [0.0]
In the present paper, we describe both methodologically (by proposing annotation principles) and technically (by creating the required training data and the relevant models) the production of a linguistic tagger for (early) modern French (16-18th c.)
We take as much as possible into account already existing standards for contemporary and, especially, medieval French.
arXiv Detail & Related papers (2020-11-22T17:39:43Z) - Corpus and Models for Lemmatisation and POS-tagging of Classical French
Theatre [0.0]
This paper describes the process of building an annotated corpus and training models for classical French literature.
It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps.
The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test.
arXiv Detail & Related papers (2020-05-15T12:47:54Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.