HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual
Morpheme Alignment
- URL: http://arxiv.org/abs/2003.07456v1
- Date: Mon, 16 Mar 2020 22:10:35 GMT
- Title: HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual
Morpheme Alignment
- Authors: Anssi Yli-Jyr\"a and Josi Purhonen and Matti Liljeqvist and Arto
Antturi and Pekka Nieminen and Kari M. R\"antil\"a and Valtter Luoto
- Abstract summary: Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts were constructed manually.
This paper describes a nontrivial editorial process starting from the creation of the original one-purpose database.
It ends with its reconstruction using only freely available text editions and annotations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Twenty-five years ago, morphologically aligned Hebrew-Finnish and
Greek-Finnish bitexts (texts accompanied by a translation) were constructed
manually in order to create an analytical concordance (Luoto et al., 1997) for
a Finnish Bible translation. The creators of the bitexts recently secured the
publisher's permission to release its fine-grained alignment, but the alignment
was still dependent on proprietary, third-party resources such as a copyrighted
text edition and proprietary morphological analyses of the source texts. In
this paper, we describe a nontrivial editorial process starting from the
creation of the original one-purpose database and ending with its
reconstruction using only freely available text editions and annotations. This
process produced an openly available dataset that contains (i) the source texts
and their translations, (ii) the morphological analyses, (iii) the
cross-lingual morpheme alignments.
Related papers
- Automatic Translation Alignment Pipeline for Multilingual Digital Editions of Literary Works [0.0]
This paper investigates the application of translation alignment algorithms in the creation of a Multilingual Digital Edition (MDE) of Alessandro Manzoni's Italian novel "I promessi sposi"
We identify key requirements for the MDE to improve both the reader experience and support for translation studies.
We propose new metrics for evaluating the alignment of literary translations and suggest visualization techniques for future analysis.
arXiv Detail & Related papers (2024-10-17T06:21:38Z) - X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs [55.80189506270598]
X-PARADE is the first cross-lingual dataset of paragraph-level information divergences.
Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language.
Aligned paragraphs are sourced from Wikipedia pages in different languages.
arXiv Detail & Related papers (2023-09-16T04:34:55Z) - Computer-Aided Modelling of the Bilingual Word Indices to the
Ninth-Century Uchitel'noe evangelie [0.0]
We show how we model various types of asymmetric translation correlates and the variability resulting from the pluralism of sources.
Our approach is designed with generalisation in mind and is intended to be applicable also for other translations from Greek into Old Church Slavonic.
arXiv Detail & Related papers (2022-10-25T10:16:39Z) - Example-Based Machine Translation from Text to a Hierarchical
Representation of Sign Language [1.3999481573773074]
This article presents an original method for Text-to-Sign Translation.
It compensates data scarcity using a domain-specific parallel corpus of alignments between text and hierarchical formal descriptions of Sign Language videos in AZee.
Based on the detection of similarities present in the source text, the proposed algorithm exploits matches and substitutions of aligned segments to build multiple candidate translations.
The resulting translations are in the form of AZee expressions, designed to be used as input to avatar systems.
arXiv Detail & Related papers (2022-05-06T15:48:43Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - InvBERT: Text Reconstruction from Contextualized Embeddings used for
Derived Text Formats of Literary Works [1.6058099298620423]
Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature.
Due to copyright restrictions, the availability of relevant digitized literary works is limited.
Our attempts to invert BERT suggest, that publishing parts of the encoder together with the contextualized embeddings is critical.
arXiv Detail & Related papers (2021-09-21T11:35:41Z) - Text Editing by Command [82.50904226312451]
A prevailing paradigm in neural text generation is one-shot generation, where text is produced in a single step.
We address this limitation with an interactive text generation setting in which the user interacts with the system by issuing commands to edit existing text.
We show that our Interactive Editor, a transformer-based model trained on this dataset, outperforms baselines and obtains positive results in both automatic and human evaluations.
arXiv Detail & Related papers (2020-10-24T08:00:30Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z) - MedLatinEpi and MedLatinLit: Two Datasets for the Computational
Authorship Analysis of Medieval Latin Texts [72.16295267480838]
We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis.
MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects.
arXiv Detail & Related papers (2020-06-22T14:22:47Z) - Building a Hebrew Semantic Role Labeling Lexical Resource from Parallel
Movie Subtitles [4.089055556130724]
We present a semantic role labeling resource for Hebrew built semi-automatically through annotation projection from English.
This corpus is derived from the multilingual OpenSubtitles dataset and includes short informal sentences.
We provide a fully annotated version of the data including morphological analysis, dependency syntax and semantic role labeling in both FrameNet and PropBank styles.
We train a neural SRL model on this Hebrew resource exploiting the pre-trained multilingual BERT transformer model, and provide the first available baseline model for Hebrew SRL as a reference point.
arXiv Detail & Related papers (2020-05-17T10:03:42Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.