Graph Algorithms for Multiparallel Word Alignment
- URL: http://arxiv.org/abs/2109.06283v1
- Date: Mon, 13 Sep 2021 19:40:29 GMT
- Title: Graph Algorithms for Multiparallel Word Alignment
- Authors: Ayyoob Imani, Masoud Jalili Sabet, L\"utfi Kerem \c{S}enel, Philipp
Dufter, Fran\c{c}ois Yvon, Hinrich Sch\"utze
- Abstract summary: In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph.
We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction.
- Score: 2.5200727733264663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advent of end-to-end deep learning approaches in machine
translation, interest in word alignments initially decreased; however, they
have again become a focus of research more recently. Alignments are useful for
typological research, transferring formatting like markup to translated texts,
and can be used in the decoding of machine translation systems. At the same
time, massively multilingual processing is becoming an important NLP scenario,
and pretrained language and machine translation models that are truly
multilingual are proposed. However, most alignment algorithms rely on bitexts
only and do not leverage the fact that many parallel corpora are multiparallel.
In this work, we exploit the multiparallelity of corpora by representing an
initial set of bilingual alignments as a graph and then predicting additional
edges in the graph. We present two graph algorithms for edge prediction: one
inspired by recommender systems and one based on network link prediction. Our
experimental results show absolute improvements in $F_1$ of up to 28% over the
baseline bilingual word aligner in different datasets.
Related papers
- How Transliterations Improve Crosslingual Alignment [48.929677368744606]
Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives can improve crosslingual alignment.
This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance.
arXiv Detail & Related papers (2024-09-25T20:05:45Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - Graph Neural Networks for Multiparallel Word Alignment [0.27998963147546146]
We compute high-quality word alignments between multiple language pairs by considering all language pairs together.
We use graph neural networks (GNNs) to exploit the graph structure.
Our method outperforms previous work on three word-alignment datasets and on a downstream task.
arXiv Detail & Related papers (2022-03-16T14:41:35Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - Do Explicit Alignments Robustly Improve Multilingual Encoders? [22.954688396858085]
multilingual encoders can effectively learn cross-lingual representation.
Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations.
We propose a new contrastive alignment objective that can better utilize such signal.
arXiv Detail & Related papers (2020-10-06T07:43:17Z) - Bilingual Text Extraction as Reading Comprehension [23.475200800530306]
We propose a method to extract bilingual texts automatically from noisy parallel corpora by framing the problem as a token-level span prediction.
To extract a span of the target document that is a translation of a given source sentence (span), we use either QANet or multilingual BERT.
arXiv Detail & Related papers (2020-04-29T23:41:32Z) - SimAlign: High Quality Word Alignments without Parallel Training Data
using Static and Contextualized Embeddings [3.8424737607413153]
We propose word alignment methods that require no parallel data.
Key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment.
We find that alignments created from embeddings are superior for two language pairs compared to those produced by traditional statistical methods.
arXiv Detail & Related papers (2020-04-18T23:10:36Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.