Exploring Large Language Models for Classical Philology
- URL: http://arxiv.org/abs/2305.13698v1
- Date: Tue, 23 May 2023 05:21:02 GMT
- Title: Exploring Large Language Models for Classical Philology
- Authors: Frederick Riemenschneider and Anette Frank
- Abstract summary: We create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages.
We evaluate all models on morphological and syntactic tasks, including lemmatization.
Results show that our models provide significant improvements over the SoTA.
- Score: 17.856304057963776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in NLP have led to the creation of powerful language models
for many languages including Ancient Greek and Latin. While prior work on
Classical languages unanimously uses BERT, in this work we create four language
models for Ancient Greek that vary along two dimensions to study their
versatility for tasks of interest for Classical languages: we explore (i)
encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong
model types, and create for each of them (ii) a monolingual Ancient Greek and a
multilingual instance that includes Latin and English. We evaluate all models
on morphological and syntactic tasks, including lemmatization, which
demonstrates the added value of T5's decoding abilities. We further define two
probing tasks to investigate the knowledge acquired by models pre-trained on
Classical texts. Our experiments provide the first benchmarking analysis of
existing models of Ancient Greek. Results show that our models provide
significant improvements over the SoTA. The systematic analysis of model types
can inform future research in designing language models for Classical
languages, including the development of novel generative tasks. We make all our
models available as community resources, along with a large curated
pre-training corpus for Ancient Greek, to support the creation of a larger,
comparable model zoo for Classical Philology. Our models and resources are
available at https://github.com/Heidelberg-NLP/ancient-language-models.
Related papers
- Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining [4.38070902806635]
We set up a benchmark for languages Croatian, Serbian, Bosnian and Montenegrin.
We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models.
We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.
arXiv Detail & Related papers (2024-04-08T11:55:44Z) - Formal Aspects of Language Modeling [74.16212987886013]
Large language models have become one of the most commonly deployed NLP inventions.
These notes are the accompaniment to the theoretical portion of the ETH Z"urich course on large language models.
arXiv Detail & Related papers (2023-11-07T20:21:42Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - Qwen Technical Report [132.54304067403922]
We introduce Qwen, the first installment of our large language model series.
Qwen is the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques.
We have also developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat.
arXiv Detail & Related papers (2023-09-28T17:07:49Z) - Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge
Distillation [0.0]
We use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.
We build a parallel sentence dataset using a sentence-embedding alignment method to align Ancient Greek documents with English translations.
We evaluate our models on translation search, semantic similarity, and semantic retrieval tasks.
arXiv Detail & Related papers (2023-08-24T23:38:44Z) - GujiBERT and GujiGPT: Construction of Intelligent Information Processing
Foundation Language Models for Ancient Texts [11.289265479095956]
GujiBERT and GujiGPT language models are foundational models specifically designed for intelligent information processing of ancient texts.
These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters.
These models have exhibited exceptional performance across a range of validation tasks using publicly available datasets.
arXiv Detail & Related papers (2023-07-11T15:44:01Z) - GreekBART: The First Pretrained Greek Sequence-to-Sequence Model [13.429669368275318]
We introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus.
We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks.
arXiv Detail & Related papers (2023-04-03T10:48:51Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.