SimAlign: High Quality Word Alignments without Parallel Training Data
using Static and Contextualized Embeddings
- URL: http://arxiv.org/abs/2004.08728v4
- Date: Fri, 16 Apr 2021 10:18:06 GMT
- Title: SimAlign: High Quality Word Alignments without Parallel Training Data
using Static and Contextualized Embeddings
- Authors: Masoud Jalili Sabet, Philipp Dufter, Fran\c{c}ois Yvon, Hinrich
Sch\"utze
- Abstract summary: We propose word alignment methods that require no parallel data.
Key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment.
We find that alignments created from embeddings are superior for two language pairs compared to those produced by traditional statistical methods.
- Score: 3.8424737607413153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word alignments are useful for tasks like statistical and neural machine
translation (NMT) and cross-lingual annotation projection. Statistical word
aligners perform well, as do methods that extract alignments jointly with
translations in NMT. However, most approaches require parallel training data,
and quality decreases as less training data is available. We propose word
alignment methods that require no parallel data. The key idea is to leverage
multilingual word embeddings, both static and contextualized, for word
alignment. Our multilingual embeddings are created from monolingual data only
without relying on any parallel data or dictionaries. We find that alignments
created from embeddings are superior for four and comparable for two language
pairs compared to those produced by traditional statistical aligners, even with
abundant parallel data; e.g., contextualized embeddings achieve a word
alignment F1 for English-German that is 5 percentage points higher than
eflomal, a high-quality statistical aligner, trained on 100k parallel
sentences.
Related papers
- How Transliterations Improve Crosslingual Alignment [48.929677368744606]
Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives can improve crosslingual alignment.
This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance.
arXiv Detail & Related papers (2024-09-25T20:05:45Z) - WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised
Span Prediction [31.96433679860807]
Most existing word alignment methods rely on manual alignment datasets or parallel corpora.
We relax the requirement for correct, fully-aligned, and parallel sentences.
We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction.
arXiv Detail & Related papers (2023-06-09T03:11:42Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - Graph Neural Networks for Multiparallel Word Alignment [0.27998963147546146]
We compute high-quality word alignments between multiple language pairs by considering all language pairs together.
We use graph neural networks (GNNs) to exploit the graph structure.
Our method outperforms previous work on three word-alignment datasets and on a downstream task.
arXiv Detail & Related papers (2022-03-16T14:41:35Z) - Constrained Density Matching and Modeling for Cross-lingual Alignment of
Contextualized Representations [27.74320705109685]
We introduce supervised and unsupervised density-based approaches named Real-NVP and GAN-Real-NVP, driven by Normalizing Flow, to perform alignment.
Our experiments encompass 16 alignments, including our approaches, evaluated across 6 language pairs, synthetic data and 4 NLP tasks.
arXiv Detail & Related papers (2022-01-31T18:41:28Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Graph Algorithms for Multiparallel Word Alignment [2.5200727733264663]
In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph.
We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction.
arXiv Detail & Related papers (2021-09-13T19:40:29Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.