From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French
- URL: http://arxiv.org/abs/2202.09452v1
- Date: Fri, 18 Feb 2022 22:17:22 GMT
- Title: From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French
- Authors: Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagu\'e,
Rachel Bawden, Philippe Gambette, Beno\^it Sagot
- Abstract summary: We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
- Score: 57.886210204774834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models for historical states of language are becoming increasingly
important to allow the optimal digitisation and analysis of old textual
sources. Because these historical states are at the same time more complex to
process and more scarce in the corpora available, specific efforts are
necessary to train natural language processing (NLP) tools adapted to the data.
In this paper, we present our efforts to develop NLP tools for Early Modern
French (historical French from the 16$^\text{th}$ to the 18$^\text{th}$
centuries). We present the $\text{FreEM}_{\text{max}}$ corpus of Early Modern
French and D'AlemBERT, a RoBERTa-based language model trained on
$\text{FreEM}_{\text{max}}$. We evaluate the usefulness of D'AlemBERT by
fine-tuning it on a part-of-speech tagging task, outperforming previous work on
the test set. Importantly, we find evidence for the transfer learning capacity
of the language model, since its performance on lesser-resourced time periods
appears to have been boosted by the more resourced ones. We release D'AlemBERT
and the open-sourced subpart of the $\text{FreEM}_{\text{max}}$ corpus.
Related papers
- Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models.
We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z) - FreCDo: A Large Corpus for French Cross-Domain Dialect Identification [22.132457694021184]
We present a novel corpus for French dialect identification comprising 413,522 French text samples.
The training, validation and test splits are collected from different news websites.
This leads to a French cross-domain (FreCDo) dialect identification task.
arXiv Detail & Related papers (2022-12-15T10:32:29Z) - LongFNT: Long-form Speech Recognition with Factorized Neural Transducer [64.75547712366784]
We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor.
The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
arXiv Detail & Related papers (2022-11-17T08:48:27Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - hmBERT: Historical Multilingual Language Models for Named Entity
Recognition [0.6226609932118123]
We tackle NER for identifying persons, locations, and organizations in historical texts.
In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models.
arXiv Detail & Related papers (2022-05-31T07:30:33Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages.
We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language.
Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.