SentAlign: Accurate and Scalable Sentence Alignment
- URL: http://arxiv.org/abs/2311.08982v1
- Date: Wed, 15 Nov 2023 14:15:41 GMT
- Title: SentAlign: Accurate and Scalable Sentence Alignment
- Authors: Stein{\th}\'or Steingr\'imsson, Hrafn Loftsson, Andy Way
- Abstract summary: SentAlign is an accurate sentence alignment tool designed to handle very large parallel document pairs.
The alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences.
- Score: 4.363828136730248
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present SentAlign, an accurate sentence alignment tool designed to handle
very large parallel document pairs. Given user-defined parameters, the
alignment algorithm evaluates all possible alignment paths in fairly large
documents of thousands of sentences and uses a divide-and-conquer approach to
align documents containing tens of thousands of sentences. The scoring function
is based on LaBSE bilingual sentence representations. SentAlign outperforms
five other sentence alignment tools when evaluated on two different evaluation
sets, German-French and English-Icelandic, and on a downstream machine
translation task.
Related papers
- How Transliterations Improve Crosslingual Alignment [48.929677368744606]
Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives can improve crosslingual alignment.
This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance.
arXiv Detail & Related papers (2024-09-25T20:05:45Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - DEPLAIN: A German Parallel Corpus with Intralingual Translations into
Plain Language for Sentence and Document Simplification [1.5223905439199599]
This paper presents DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German.
We show that using DEplain to train a transformer-based seq2seq text simplification model can achieve promising results.
We make available the corpus, the adapted alignment methods for German, the web harvester and the trained models here.
arXiv Detail & Related papers (2023-05-30T11:07:46Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - A New Aligned Simple German Corpus [2.7981463795578927]
We present a new sentence-aligned monolingual corpus for Simple German -- German.
It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment methods.
The quality of our sentence alignments, as measured by F1-score, surpasses previous work.
arXiv Detail & Related papers (2022-09-02T15:14:04Z) - EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation [63.88541605363555]
"Extract and Generate" (EAG) is a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data.
We first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences.
We then generate the final aligned examples from the candidates with a well-trained generation model.
arXiv Detail & Related papers (2022-03-04T08:21:27Z) - Using Optimal Transport as Alignment Objective for fine-tuning
Multilingual Contextualized Embeddings [7.026476782041066]
We propose using Optimal Transport (OT) as an alignment objective during fine-tuning to improve multilingual contextualized representations.
This approach does not require word-alignment pairs prior to fine-tuning and instead learns the word alignments within context in an unsupervised manner.
arXiv Detail & Related papers (2021-10-06T16:13:45Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - Subword Sampling for Low Resource Word Alignment [4.663577299263155]
We propose subword sampling-based alignment of text units.
We show that the subword sampling method consistently outperforms word-level alignment on six language pairs.
arXiv Detail & Related papers (2020-12-21T19:47:04Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z) - Massively Multilingual Document Alignment with Cross-lingual
Sentence-Mover's Distance [8.395430195053061]
Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other.
We develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages.
These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs.
arXiv Detail & Related papers (2020-01-31T05:14:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.