Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling
Approach
- URL: http://arxiv.org/abs/2109.04513v1
- Date: Thu, 9 Sep 2021 18:58:14 GMT
- Title: Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling
Approach
- Authors: Koren Lazar, Benny Saret, Asaf Yehudai, Wayne Horowitz, Nathan
Wasserman, Gabriel Stanovsky
- Abstract summary: We present models which complete missing text given transliterations of ancient Mesopotamian documents.
Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text.
- Score: 8.00388161728995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present models which complete missing text given transliterations of
ancient Mesopotamian documents, originally written on cuneiform clay tablets
(2500 BCE - 100 CE). Due to the tablets' deterioration, scholars often rely on
contextual cues to manually fill in missing parts in the text in a subjective
and time-consuming process. We identify that this challenge can be formulated
as a masked language modelling task, used mostly as a pretraining objective for
contextualized language models. Following, we develop several architectures
focusing on the Akkadian language, the lingua franca of the time. We find that
despite data scarcity (1M tokens) we can achieve state of the art performance
on missing tokens prediction (89% hit@5) using a greedy decoding scheme and
pretraining on data from other languages and different time periods. Finally,
we conduct human evaluations showing the applicability of our models in
assisting experts to transcribe texts in extinct languages.
Related papers
- Few-Shot Detection of Machine-Generated Text using Style Representations [4.326503887981912]
Language models that convincingly mimic human writing pose a significant risk of abuse.
We propose to leverage representations of writing style estimated from human-authored text.
We find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors.
arXiv Detail & Related papers (2024-01-12T17:26:51Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Crosslingual Structural Priming and the Pre-Training Dynamics of
Bilingual Language Models [6.845954748361076]
We use structural priming to test for abstract grammatical representations with causal effects on model outputs.
We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-training.
We find that crosslingual structural priming effects emerge early after exposure to the second language, with less than 1M tokens of data in that language.
arXiv Detail & Related papers (2023-10-11T22:57:03Z) - Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge
Distillation [0.0]
We use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.
We build a parallel sentence dataset using a sentence-embedding alignment method to align Ancient Greek documents with English translations.
We evaluate our models on translation search, semantic similarity, and semantic retrieval tasks.
arXiv Detail & Related papers (2023-08-24T23:38:44Z) - Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text
Diacritization [10.342180619706724]
We finetune token-free pre-trained multilingual models to learn to predict and insert missing diacritics in Arabic text.
We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering.
arXiv Detail & Related papers (2023-03-25T23:41:33Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Supporting Undotted Arabic with Pre-trained Language Models [0.0]
We study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts.
We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing tasks.
arXiv Detail & Related papers (2021-11-18T16:47:56Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - Enabling Language Models to Fill in the Blanks [81.59381915581892]
We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document.
We train (or fine-tune) off-the-shelf language models on sequences containing the concatenation of artificially-masked text and the text which was masked.
We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics.
arXiv Detail & Related papers (2020-05-11T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.