Standardizing linguistic data: method and tools for annotating
(pre-orthographic) French
- URL: http://arxiv.org/abs/2011.11074v1
- Date: Sun, 22 Nov 2020 17:39:43 GMT
- Title: Standardizing linguistic data: method and tools for annotating
(pre-orthographic) French
- Authors: Simon Gabay (UNIGE), Thibault Cl\'erice (ENC), Jean-Baptiste Camps
(ENC), Jean-Baptiste Tanguy (SU), Matthias Gille-Levenson (ENS Lyon)
- Abstract summary: In the present paper, we describe both methodologically (by proposing annotation principles) and technically (by creating the required training data and the relevant models) the production of a linguistic tagger for (early) modern French (16-18th c.)
We take as much as possible into account already existing standards for contemporary and, especially, medieval French.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of big corpora of various periods, it becomes crucial to
standardise linguistic annotation (e.g. lemmas, POS tags, morphological
annotation) to increase the interoperability of the data produced, despite
diachronic variations. In the present paper, we describe both methodologically
(by proposing annotation principles) and technically (by creating the required
training data and the relevant models) the production of a linguistic tagger
for (early) modern French (16-18th c.), taking as much as possible into account
already existing standards for contemporary and, especially, medieval French.
Related papers
- MACT: Model-Agnostic Cross-Lingual Training for Discourse Representation Structure Parsing [4.536003573070846]
We introduce a cross-lingual training strategy for semantic representation parsing models.
It exploits the alignments between languages encoded in pre-trained language models.
Experiments show significant improvements in DRS clause and graph parsing in English, German, Italian and Dutch.
arXiv Detail & Related papers (2024-06-03T07:02:57Z) - We're Calling an Intervention: Exploring the Fundamental Hurdles in Adapting Language Models to Nonstandard Text [8.956635443376527]
We present a suite of experiments that allow us to understand the underlying challenges of language model adaptation to nonstandard text.
We do so by designing interventions that approximate several types of linguistic variation and their interactions with existing biases of language models.
Applying our interventions during language model adaptation with varying size and nature of training data, we gain important insights into when knowledge transfer can be successful.
arXiv Detail & Related papers (2024-04-10T18:56:53Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Metadata Might Make Language Models Better [1.7100280218774935]
Using 19th-century newspapers as a case study, we compare different strategies for inserting temporal, political and geographical information into a Masked Language Model.
We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models.
arXiv Detail & Related papers (2022-11-18T08:29:00Z) - Improving Temporal Generalization of Pre-trained Language Models with
Lexical Semantic Change [28.106524698188675]
Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability.
We propose a simple yet effective lexical-level masking strategy to post-train a converged language model.
arXiv Detail & Related papers (2022-10-31T08:12:41Z) - Benchmarking Transformers-based models on French Spoken Language
Understanding tasks [4.923118300276026]
We benchmark 13 Transformer-based models on two spoken language understanding tasks for French: MEDIA and ATIS-FR.
We show that compact models can reach comparable results to bigger ones while their ecological impact is considerably lower.
arXiv Detail & Related papers (2022-07-19T09:47:08Z) - Cross-lingual Lifelong Learning [53.06904052325966]
We present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm.
We provide insights into what makes multilingual sequential learning particularly challenging.
The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata.
arXiv Detail & Related papers (2022-05-23T09:25:43Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.