Latin BERT: A Contextual Language Model for Classical Philology
- URL: http://arxiv.org/abs/2009.10053v1
- Date: Mon, 21 Sep 2020 17:47:44 GMT
- Title: Latin BERT: A Contextual Language Model for Classical Philology
- Authors: David Bamman and Patrick J. Burns
- Abstract summary: We present Latin BERT, a contextual language model for the Latin language.
It was trained on 642.7 million words from a variety of sources spanning the Classical era to the 21st century.
- Score: 7.513100214864645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Latin BERT, a contextual language model for the Latin language,
trained on 642.7 million words from a variety of sources spanning the Classical
era to the 21st century. In a series of case studies, we illustrate the
affordances of this language-specific model both for work in natural language
processing for Latin and in using computational methods for traditional
scholarship: we show that Latin BERT achieves a new state of the art for
part-of-speech tagging on all three Universal Dependency datasets for Latin and
can be used for predicting missing text (including critical emendations); we
create a new dataset for assessing word sense disambiguation for Latin and
demonstrate that Latin BERT outperforms static word embeddings; and we show
that it can be used for semantically-informed search by querying contextual
nearest neighbors. We publicly release trained models to help drive future work
in this space.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - LiMe: a Latin Corpus of Late Medieval Criminal Sentences [39.26357402982764]
We present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani.
arXiv Detail & Related papers (2024-04-19T12:06:28Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - The Frankfurt Latin Lexicon: From Morphological Expansion and Word
Embeddings to SemioGraphs [97.8648124629697]
The article argues for a more comprehensive understanding of lemmatization, encompassing classical machine learning as well as intellectual post-corrections and, in particular, human interpretation processes based on graph representations of the underlying lexical resources.
arXiv Detail & Related papers (2020-05-21T17:16:53Z) - Phonetic and Visual Priors for Decipherment of Informal Romanization [37.77170643560608]
We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text.
We train our model directly on romanized data from two languages: Egyptian Arabic and Russian.
We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model's performance on both languages.
arXiv Detail & Related papers (2020-05-05T21:57:27Z) - A Survey on Contextual Embeddings [48.04732268018772]
Contextual embeddings assign each word a representation based on its context, capturing uses of words across varied contexts and encoding knowledge that transfers across languages.
We review existing contextual embedding models, cross-lingual polyglot pre-training, the application of contextual embeddings in downstream tasks, model compression, and model analyses.
arXiv Detail & Related papers (2020-03-16T15:22:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.