Restoration of Fragmentary Babylonian Texts Using Recurrent Neural
Networks
- URL: http://arxiv.org/abs/2003.01912v1
- Date: Wed, 4 Mar 2020 06:36:50 GMT
- Title: Restoration of Fragmentary Babylonian Texts Using Recurrent Neural
Networks
- Authors: Ethan Fetaya, Yonatan Lifshitz, Elad Aaron and Shai Gordin
- Abstract summary: The main source of information regarding ancient Mesopotamian history and culture are clay cuneiform tablets.
Despite being an invaluable resource, many tablets are fragmented leading to missing information.
In this work we investigate the possibility of assisting scholars and even automatically completing the breaks in ancient Akkadian texts from Achaemenid period Babylonia by modelling the language using recurrent neural networks.
- Score: 14.024892678242379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The main source of information regarding ancient Mesopotamian history and
culture are clay cuneiform tablets. Despite being an invaluable resource, many
tablets are fragmented leading to missing information. Currently these missing
parts are manually completed by experts. In this work we investigate the
possibility of assisting scholars and even automatically completing the breaks
in ancient Akkadian texts from Achaemenid period Babylonia by modelling the
language using recurrent neural networks.
Related papers
- Rejoining fragmented ancient bamboo slips with physics-driven deep learning [77.2197174265539]
WisePanda is a physics-driven deep learning framework designed to rejoin fragmented bamboo slips.<n>Based on the physics of fracture and material deterioration, WisePanda automatically generates synthetic training data.<n>Archaeologists using WisePanda have experienced substantial efficiency improvements.
arXiv Detail & Related papers (2025-05-13T14:16:53Z) - Measuring Non-Adversarial Reproduction of Training Data in Large Language Models [71.55350441396243]
We quantify the overlap between model responses and pretraining data when responding to natural and benign prompts.
We find that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet.
While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses.
arXiv Detail & Related papers (2024-11-15T14:55:01Z) - Analysis of Plan-based Retrieval for Grounded Text Generation [78.89478272104739]
hallucinations occur when a language model is given a generation task outside its parametric knowledge.
A common strategy to address this limitation is to infuse the language models with retrieval mechanisms.
We analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations.
arXiv Detail & Related papers (2024-08-20T02:19:35Z) - Lacuna Language Learning: Leveraging RNNs for Ranked Text Completion in Digitized Coptic Manuscripts [8.30703600268965]
We present a bidirectional RNN model for character prediction of Coptic characters in manuscript lacunae.
Our best model performs with 72% accuracy on single character reconstruction, but falls to 37% when reconstructing lacunae of various lengths.
arXiv Detail & Related papers (2024-07-17T01:28:12Z) - Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z) - Restoring Ancient Ideograph: A Multimodal Multitask Neural Network
Approach [11.263700269889654]
This paper proposes a novel Multimodal Multitask Restoring Model (MMRM) to restore ancient texts.
It combines context understanding with residual visual information from damaged ancient artefacts, enabling it to predict damaged characters and generate restored images simultaneously.
arXiv Detail & Related papers (2024-03-11T12:57:28Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - An open dataset for oracle bone script recognition and decipherment [66.35957530824872]
Oracle bone script, one of the earliest known forms of ancient Chinese writing, presents invaluable research materials for scholars studying the humanities and geography of the Shang Dynasty, dating back 3,000 years.
The passage of time has obscured much of their meaning, presenting a significant challenge in deciphering these ancient texts.
With the advent of Artificial Intelligence (AI), employing AI to assist in deciphering Oracle Bone Characters (OBCs) has become a feasible option.
This dataset encompasses 77,064 images of 1,588 individual deciphered characters and 62,989 images of 9,411 undeciphered characters, with a total of 140,
arXiv Detail & Related papers (2024-01-27T09:54:16Z) - Style Classification of Rabbinic Literature for Detection of Lost
Midrash Tanhuma Material [1.933681537640272]
We propose a system for classification of rabbinic literature based on its style.
We show how this method can be applied to uncover lost material from a specific midrash genre.
arXiv Detail & Related papers (2022-11-17T17:45:59Z) - Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling
Approach [8.00388161728995]
We present models which complete missing text given transliterations of ancient Mesopotamian documents.
Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text.
arXiv Detail & Related papers (2021-09-09T18:58:14Z) - MedLatinEpi and MedLatinLit: Two Datasets for the Computational
Authorship Analysis of Medieval Latin Texts [72.16295267480838]
We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis.
MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects.
arXiv Detail & Related papers (2020-06-22T14:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.