Subword Sampling for Low Resource Word Alignment
- URL: http://arxiv.org/abs/2012.11657v1
- Date: Mon, 21 Dec 2020 19:47:04 GMT
- Title: Subword Sampling for Low Resource Word Alignment
- Authors: Ehsaneddin Asgari and Masoud Jalili Sabet and Philipp Dufter and
Christopher Ringlstetter and Hinrich Sch\"utze
- Abstract summary: We propose subword sampling-based alignment of text units.
We show that the subword sampling method consistently outperforms word-level alignment on six language pairs.
- Score: 4.663577299263155
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Annotation projection is an important area in NLP that can greatly contribute
to creating language resources for low-resource languages. Word alignment plays
a key role in this setting. However, most of the existing word alignment
methods are designed for a high resource setting in machine translation where
millions of parallel sentences are available. This amount reduces to a few
thousands of sentences when dealing with low-resource languages failing the
existing established IBM models. In this paper, we propose subword
sampling-based alignment of text units. This method's hypothesis is that the
aggregation of different granularities of text for certain language pairs can
help word-level alignment. For certain languages for which gold-standard
alignments exist, we propose an iterative Bayesian optimization framework to
optimize selecting possible subwords from the space of possible subword
representations of the source and target sentences. We show that the subword
sampling method consistently outperforms word-level alignment on six language
pairs: English-German, English-French, English-Romanian, English-Persian,
English-Hindi, and English-Inuktitut. In addition, we show that the
hyperparameters learned for certain language pairs can be applied to other
languages at no supervision and consistently improve the alignment results. We
observe that using $5K$ parallel sentences together with our proposed subword
sampling approach, we obtain similar F1 scores to the use of $100K$'s of
parallel sentences in existing word-level fast-align/eflomal alignment methods.
Related papers
- Multilingual Sentence Transformer as A Multilingual Word Aligner [15.689680887384847]
We investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner.
Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties.
Our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.
arXiv Detail & Related papers (2023-01-28T09:28:55Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - Using Optimal Transport as Alignment Objective for fine-tuning
Multilingual Contextualized Embeddings [7.026476782041066]
We propose using Optimal Transport (OT) as an alignment objective during fine-tuning to improve multilingual contextualized representations.
This approach does not require word-alignment pairs prior to fine-tuning and instead learns the word alignments within context in an unsupervised manner.
arXiv Detail & Related papers (2021-10-06T16:13:45Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - A Supervised Word Alignment Method based on Cross-Language Span
Prediction using Multilingual BERT [22.701728185474195]
We first formalize a word alignment problem as a collection of independent predictions from a token in the source sentence to a span in the target sentence.
We then solve this problem by using multilingual BERT, which is fine-tuned on a manually created gold word alignment data.
We show that the proposed method significantly outperformed previous supervised and unsupervised word alignment methods without using any bitexts for pretraining.
arXiv Detail & Related papers (2020-04-29T23:40:08Z) - SimAlign: High Quality Word Alignments without Parallel Training Data
using Static and Contextualized Embeddings [3.8424737607413153]
We propose word alignment methods that require no parallel data.
Key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment.
We find that alignments created from embeddings are superior for two language pairs compared to those produced by traditional statistical methods.
arXiv Detail & Related papers (2020-04-18T23:10:36Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.