Word Alignment by Fine-tuning Embeddings on Parallel Corpora
- URL: http://arxiv.org/abs/2101.08231v2
- Date: Sun, 24 Jan 2021 23:24:00 GMT
- Title: Word Alignment by Fine-tuning Embeddings on Parallel Corpora
- Authors: Zi-Yi Dou, Graham Neubig
- Abstract summary: Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
- Score: 96.28608163701055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word alignment over parallel corpora has a wide variety of applications,
including learning translation lexicons, cross-lingual transfer of language
processing tools, and automatic evaluation or analysis of translation outputs.
The great majority of past work on word alignment has worked by performing
unsupervised learning on parallel texts. Recently, however, other work has
demonstrated that pre-trained contextualized word embeddings derived from
multilingually trained language models (LMs) prove an attractive alternative,
achieving competitive results on the word alignment task even in the absence of
explicit training on parallel data. In this paper, we examine methods to marry
the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel
text with objectives designed to improve alignment quality, and proposing
methods to effectively extract alignments from these fine-tuned models. We
perform experiments on five language pairs and demonstrate that our model can
consistently outperform previous state-of-the-art models of all varieties. In
addition, we demonstrate that we are able to train multilingual word aligners
that can obtain robust performance on different language pairs. Our aligner,
AWESOME (Aligning Word Embedding Spaces of Multilingual Encoders), with
pre-trained models is available at https://github.com/neulab/awesome-align
Related papers
- VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Multilingual Sentence Transformer as A Multilingual Word Aligner [15.689680887384847]
We investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner.
Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties.
Our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.
arXiv Detail & Related papers (2023-01-28T09:28:55Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Graph Algorithms for Multiparallel Word Alignment [2.5200727733264663]
In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph.
We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction.
arXiv Detail & Related papers (2021-09-13T19:40:29Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Cross-lingual Alignment Methods for Multilingual BERT: A Comparative
Study [2.101267270902429]
We analyse how different forms of cross-lingual supervision and various alignment methods influence the transfer capability of mBERT in zero-shot setting.
We find that supervision from parallel corpus is generally superior to dictionary alignments.
arXiv Detail & Related papers (2020-09-29T20:56:57Z) - SimAlign: High Quality Word Alignments without Parallel Training Data
using Static and Contextualized Embeddings [3.8424737607413153]
We propose word alignment methods that require no parallel data.
Key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment.
We find that alignments created from embeddings are superior for two language pairs compared to those produced by traditional statistical methods.
arXiv Detail & Related papers (2020-04-18T23:10:36Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.