Don't Forget Cheap Training Signals Before Building Unsupervised
Bilingual Word Embeddings
- URL: http://arxiv.org/abs/2205.15713v1
- Date: Tue, 31 May 2022 12:00:55 GMT
- Title: Don't Forget Cheap Training Signals Before Building Unsupervised
Bilingual Word Embeddings
- Authors: Silvia Severini, Viktor Hangya, Masoud Jalili Sabet, Alexander Fraser,
Hinrich Sch\"utze
- Abstract summary: We argue that easy-to-access cross-lingual signals should always be considered when developing unsupervised BWE methods.
We show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs.
Our results show that these training signals should not be neglected when building BWEs, even for distant languages.
- Score: 64.06041300946517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual
transfer of NLP models. They can be built using only monolingual corpora
without supervision leading to numerous works focusing on unsupervised BWEs.
However, most of the current approaches to build unsupervised BWEs do not
compare their results with methods based on easy-to-access cross-lingual
signals. In this paper, we argue that such signals should always be considered
when developing unsupervised BWE methods. The two approaches we find most
effective are: 1) using identical words as seed lexicons (which unsupervised
approaches incorrectly assume are not available for orthographically distinct
language pairs) and 2) combining such lexicons with pairs extracted by matching
romanized versions of words with an edit distance threshold. We experiment on
thirteen non-Latin languages (and English) and show that such cheap signals
work well and that they outperform using more complex unsupervised methods on
distant language pairs such as Chinese, Japanese, Kannada, Tamil, and Thai. In
addition, they are even competitive with the use of high-quality lexicons in
supervised approaches. Our results show that these training signals should not
be neglected when building BWEs, even for distant languages.
Related papers
- Unsupervised Bilingual Lexicon Induction for Low Resource Languages [0.9653538131757154]
We use the unsupervised version of VecMap, a commonly used structure-based UBLI framework.
We carry out a comprehensive set of experiments using the LRL pairs, English-Sinhala, English-Tamil, and English-Punjabi.
These experiments helped us to identify the best combination of the extensions.
arXiv Detail & Related papers (2024-12-22T07:07:09Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Multilingual Sentence Transformer as A Multilingual Word Aligner [15.689680887384847]
We investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner.
Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties.
Our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.
arXiv Detail & Related papers (2023-01-28T09:28:55Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Bilingual Lexicon Induction via Unsupervised Bitext Construction and
Word Alignment [49.3253280592705]
We show it is possible to produce much higher quality lexicons with methods that combine bitext mining and unsupervised word alignment.
Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs.
arXiv Detail & Related papers (2021-01-01T03:12:42Z) - Globetrotter: Unsupervised Multilingual Translation from Visual
Alignment [24.44204156935044]
We introduce a framework that uses the visual modality to align multiple languages.
We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations.
Our language representations are trained jointly in one model with a single stage.
arXiv Detail & Related papers (2020-12-08T18:50:40Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - Cross-lingual Supervision Improves Unsupervised Neural Machine
Translation [97.84871088440102]
We introduce a multilingual unsupervised NMT framework to leverage weakly supervised signals from high-resource language pairs to zero-resource translation directions.
Method significantly improves the translation quality by more than 3 BLEU score on six benchmark unsupervised translation directions.
arXiv Detail & Related papers (2020-04-07T05:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.