Related papers: SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

URL: http://arxiv.org/abs/2004.08728v4
Date: Fri, 16 Apr 2021 10:18:06 GMT
Title: SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings
Authors: Masoud Jalili Sabet, Philipp Dufter, Fran\c{c}ois Yvon, Hinrich Sch\"utze
Abstract summary: We propose word alignment methods that require no parallel data. Key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. We find that alignments created from embeddings are superior for two language pairs compared to those produced by traditional statistical methods.
Score: 3.8424737607413153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners, even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

Related papers

How Transliterations Improve Crosslingual Alignment [48.929677368744606]
Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives can improve crosslingual alignment. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance.
arXiv Detail & Related papers (2024-09-25T20:05:45Z)
WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction [31.96433679860807]
Most existing word alignment methods rely on manual alignment datasets or parallel corpora. We relax the requirement for correct, fully-aligned, and parallel sentences. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction.
arXiv Detail & Related papers (2023-06-09T03:11:42Z)
OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result. We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z)
Graph Neural Networks for Multiparallel Word Alignment [0.27998963147546146]
We compute high-quality word alignments between multiple language pairs by considering all language pairs together. We use graph neural networks (GNNs) to exploit the graph structure. Our method outperforms previous work on three word-alignment datasets and on a downstream task.
arXiv Detail & Related papers (2022-03-16T14:41:35Z)
Constrained Density Matching and Modeling for Cross-lingual Alignment of Contextualized Representations [27.74320705109685]
We introduce supervised and unsupervised density-based approaches named Real-NVP and GAN-Real-NVP, driven by Normalizing Flow, to perform alignment. Our experiments encompass 16 alignments, including our approaches, evaluated across 6 language pairs, synthetic data and 4 NLP tasks.
arXiv Detail & Related papers (2022-01-31T18:41:28Z)
Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models. We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks. We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z)
Graph Algorithms for Multiparallel Word Alignment [2.5200727733264663]
In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction.
arXiv Detail & Related papers (2021-09-13T19:40:29Z)
Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences. We introduce denoising word alignment as a new cross-lingual pre-training task. Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z)
Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs. Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data. In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z)
On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings. We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.