Targum -- A Multilingual New Testament Translation Corpus
- URL: http://arxiv.org/abs/2602.09724v1
- Date: Tue, 10 Feb 2026 12:27:57 GMT
- Title: Targum -- A Multilingual New Testament Translation Corpus
- Authors: Maciej Rapacz, Aleksander SmywiĆski-Pohl,
- Abstract summary: We introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102)<n>Each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision.<n>This canonicalization empowers researchers to define "uniqueness" for their own needs.
- Score: 46.390064640459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define "uniqueness" for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.
Related papers
- Efficacy of ByT5 in Multilingual Translation of Biblical Texts for Underrepresented Languages [3.313876945324241]
This study presents the development and evaluation of a ByT5-based multilingual translation model tailored for translating the Bible into underrepresented languages.
We trained the model to capture the intricate nuances of character-based and morphologically rich languages.
Our results, measured by the BLEU score and supplemented with sample translations, suggest the model can improve accessibility to sacred texts.
arXiv Detail & Related papers (2024-05-22T05:12:35Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - The eBible Corpus: Data and Model Benchmarks for Bible Translation for
Low-Resource Languages [1.4681482563848867]
Bible translation (BT) work is currently underway for over 3000 extremely low resource languages.
We introduce the eBible corpus: a dataset containing 1009 translations of portions of the Bible with data in 833 different languages across 75 language families.
In addition to a BT dataset benchmarking, we introduce model performance benchmarks built on the No Language Left Behind (NLLB) neural machine translation (NMT) models.
arXiv Detail & Related papers (2023-04-19T18:52:49Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual
Morpheme Alignment [0.0]
Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts were constructed manually.
This paper describes a nontrivial editorial process starting from the creation of the original one-purpose database.
It ends with its reconstruction using only freely available text editions and annotations.
arXiv Detail & Related papers (2020-03-16T22:10:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.