Unsupervised Bilingual Lexicon Induction for Low Resource Languages
- URL: http://arxiv.org/abs/2412.16894v1
- Date: Sun, 22 Dec 2024 07:07:09 GMT
- Title: Unsupervised Bilingual Lexicon Induction for Low Resource Languages
- Authors: Charitha Rathnayake, P. R. S. Thilakarathna, Uthpala Nethmini, Rishemjith Kaur, Surangika Ranathunga,
- Abstract summary: We use the unsupervised version of VecMap, a commonly used structure-based UBLI framework.
We carry out a comprehensive set of experiments using the LRL pairs, English-Sinhala, English-Tamil, and English-Punjabi.
These experiments helped us to identify the best combination of the extensions.
- Score: 0.9653538131757154
- License:
- Abstract: Bilingual lexicons play a crucial role in various Natural Language Processing tasks. However, many low-resource languages (LRLs) do not have such lexicons, and due to the same reason, cannot benefit from the supervised Bilingual Lexicon Induction (BLI) techniques. To address this, unsupervised BLI (UBLI) techniques were introduced. A prominent technique in this line is structure-based UBLI. It is an iterative method, where a seed lexicon, which is initially learned from monolingual embeddings is iteratively improved. There have been numerous improvements to this core idea, however they have been experimented with independently of each other. In this paper, we investigate whether using these techniques simultaneously would lead to equal gains. We use the unsupervised version of VecMap, a commonly used structure-based UBLI framework, and carry out a comprehensive set of experiments using the LRL pairs, English-Sinhala, English-Tamil, and English-Punjabi. These experiments helped us to identify the best combination of the extensions. We also release bilingual dictionaries for English-Sinhala and English-Punjabi.
Related papers
- Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP)
When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs.
It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z) - Self-Augmented In-Context Learning for Unsupervised Word Translation [23.495503962839337]
Large language models (LLMs) demonstrate strong word translation or bilingual lexicon induction (BLI) capabilities in few-shot setups.
We propose self-augmented in-context learning (SAIL) for unsupervised BLI.
Our method shows substantial gains over zero-shot prompting of LLMs on two established BLI benchmarks.
arXiv Detail & Related papers (2024-02-15T15:43:05Z) - On Bilingual Lexicon Induction with Large Language Models [81.6546357879259]
We examine the potential of the latest generation of Large Language Models for the development of bilingual lexicons.
We study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs.
Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs.
arXiv Detail & Related papers (2023-10-21T12:43:27Z) - When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages [29.346191691508125]
Unsupervised bilingual lexicon induction is most likely to be useful for low-resource languages, where large datasets are not available.
We show that state-of-the-art BLI methods in the literature exhibit near-zero performance for severely data-imbalanced language pairs.
We present a new method for unsupervised BLI between a related LRL and HRL that only requires inference on a masked language model of the HRL.
arXiv Detail & Related papers (2023-05-23T12:49:21Z) - Improving Bilingual Lexicon Induction with Cross-Encoder Reranking [31.142790337451366]
We propose a novel semi-supervised post-hoc reranking method termed BLICEr (BLI with Cross-Encoder Reranking)
The key idea is to 'extract' cross-lingual lexical knowledge from mPLMs, and then combine it with the original CLWEs.
BLICEr establishes new results on two standard BLI benchmarks spanning a wide spectrum of diverse languages.
arXiv Detail & Related papers (2022-10-30T21:26:07Z) - Don't Forget Cheap Training Signals Before Building Unsupervised
Bilingual Word Embeddings [64.06041300946517]
We argue that easy-to-access cross-lingual signals should always be considered when developing unsupervised BWE methods.
We show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs.
Our results show that these training signals should not be neglected when building BWEs, even for distant languages.
arXiv Detail & Related papers (2022-05-31T12:00:55Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Bilingual Lexicon Induction via Unsupervised Bitext Construction and
Word Alignment [49.3253280592705]
We show it is possible to produce much higher quality lexicons with methods that combine bitext mining and unsupervised word alignment.
Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs.
arXiv Detail & Related papers (2021-01-01T03:12:42Z) - A Relaxed Matching Procedure for Unsupervised BLI [19.99658962367335]
We propose a relaxed matching procedure to find a more precise matching between two languages.
We also find that aligning source and target language embedding space bidirectionally will bring significant improvement.
arXiv Detail & Related papers (2020-10-14T13:53:08Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.