ProMap: Effective Bilingual Lexicon Induction via Language Model
Prompting
- URL: http://arxiv.org/abs/2310.18778v1
- Date: Sat, 28 Oct 2023 18:33:24 GMT
- Title: ProMap: Effective Bilingual Lexicon Induction via Language Model
Prompting
- Authors: Abdellah El Mekki, Muhammad Abdul-Mageed, ElMoatez Billah Nagoudi,
Ismail Berrada and Ahmed Khoumsi
- Abstract summary: We introduce ProMap, a novel approach for bilingual induction (BLI)
ProMap relies on an effective padded prompting of language models with a seed dictionary that achieves good performance when used independently.
When evaluated on both rich-resource and low-resource languages, ProMap consistently achieves state-of-the-art results.
- Score: 22.743097175747575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bilingual Lexicon Induction (BLI), where words are translated between two
languages, is an important NLP task. While noticeable progress on BLI in rich
resource languages using static word embeddings has been achieved. The word
translation performance can be further improved by incorporating information
from contextualized word embeddings. In this paper, we introduce ProMap, a
novel approach for BLI that leverages the power of prompting pretrained
multilingual and multidialectal language models to address these challenges. To
overcome the employment of subword tokens in these models, ProMap relies on an
effective padded prompting of language models with a seed dictionary that
achieves good performance when used independently. We also demonstrate the
effectiveness of ProMap in re-ranking results from other BLI methods such as
with aligned static word embeddings. When evaluated on both rich-resource and
low-resource languages, ProMap consistently achieves state-of-the-art results.
Furthermore, ProMap enables strong performance in few-shot scenarios (even with
less than 10 training examples), making it a valuable tool for low-resource
language translation. Overall, we believe our method offers both exciting and
promising direction for BLI in general and low-resource languages in
particular. ProMap code and data are available at
\url{https://github.com/4mekki4/promap}.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - How Lexical is Bilingual Lexicon Induction? [1.3610643403050855]
We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction.
We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2% across all language pairs.
arXiv Detail & Related papers (2024-04-05T17:10:33Z) - On Bilingual Lexicon Induction with Large Language Models [81.6546357879259]
We examine the potential of the latest generation of Large Language Models for the development of bilingual lexicons.
We study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs.
Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs.
arXiv Detail & Related papers (2023-10-21T12:43:27Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Cross-lingual alignments of ELMo contextual embeddings [0.0]
Cross-lingual embeddings map word embeddings from a low-resource language to a high-resource language.
To produce cross-lingual mappings of recent contextual embeddings, anchor points between the embedding spaces have to be words in the same context.
We propose novel cross-lingual mapping methods for ELMo embeddings.
arXiv Detail & Related papers (2021-06-30T11:26:43Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Mining Knowledge for Natural Language Inference from Wikipedia
Categories [53.26072815839198]
We introduce WikiNLI: a resource for improving model performance on NLI and LE tasks.
It contains 428,899 pairs of phrases constructed from naturally annotated category hierarchies in Wikipedia.
We show that we can improve strong baselines such as BERT and RoBERTa by pretraining them on WikiNLI and transferring the models on downstream tasks.
arXiv Detail & Related papers (2020-10-03T00:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.