Related papers: From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

URL: http://arxiv.org/abs/2202.09452v1
Date: Fri, 18 Feb 2022 22:17:22 GMT
Title: From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Authors: Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagu\'e, Rachel Bawden, Philippe Gambette, Beno\^it Sagot
Abstract summary: We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries). We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
Score: 57.886210204774834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16$^\text{th}$ to the 18$^\text{th}$ centuries). We present the $\text{FreEM}_{\text{max}}$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $\text{FreEM}_{\text{max}}$. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the $\text{FreEM}_{\text{max}}$ corpus.

Related papers

Adapting Multilingual Embedding Models to Historical Luxembourgish [5.474797258314828]
This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish. We use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair. We adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models.
arXiv Detail & Related papers (2025-02-11T20:35:29Z)
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models [0.0]
This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language. We also introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts.
arXiv Detail & Related papers (2025-01-08T20:29:00Z)
Efficient Continual Pre-training of LLMs for Low-resource Languages [45.44796295841526]
We develop a new algorithm to select a subset of texts from a larger corpus.<n>In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.
arXiv Detail & Related papers (2024-12-13T16:13:35Z)
Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z)
Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z)
FreCDo: A Large Corpus for French Cross-Domain Dialect Identification [22.132457694021184]
We present a novel corpus for French dialect identification comprising 413,522 French text samples. The training, validation and test splits are collected from different news websites. This leads to a French cross-domain (FreCDo) dialect identification task.
arXiv Detail & Related papers (2022-12-15T10:32:29Z)
LongFNT: Long-form Speech Recognition with Factorized Neural Transducer [64.75547712366784]
We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
arXiv Detail & Related papers (2022-11-17T08:48:27Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
hmBERT: Historical Multilingual Language Models for Named Entity Recognition [0.6226609932118123]
We tackle NER for identifying persons, locations, and organizations in historical texts. In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models.
arXiv Detail & Related papers (2022-05-31T07:30:33Z)
Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language. We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z)
TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages. We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language. Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z)
PAGnol: An Extra-Large French Generative Model [53.40189314359048]
We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL with the same computational budget as CamemBERT.
arXiv Detail & Related papers (2021-10-16T11:44:23Z)
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models. We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings. The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.