Related papers: Improving Bilingual Lexicon Induction with Cross-Encoder Reranking

Improving Bilingual Lexicon Induction with Cross-Encoder Reranking

URL: http://arxiv.org/abs/2210.16953v2
Date: Thu, 17 Oct 2024 22:47:50 GMT
Title: Improving Bilingual Lexicon Induction with Cross-Encoder Reranking
Authors: Yaoyiran Li, Fangyu Liu, Ivan Vulić, Anna Korhonen,
Abstract summary: We propose a novel semi-supervised post-hoc reranking method termed BLICEr (BLI with Cross-Encoder Reranking) The key idea is to 'extract' cross-lingual lexical knowledge from mPLMs, and then combine it with the original CLWEs. BLICEr establishes new results on two standard BLI benchmarks spanning a wide spectrum of diverse languages.
Score: 31.142790337451366
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bilingual lexicon induction (BLI) with limited bilingual supervision is a crucial yet challenging task in multilingual NLP. Current state-of-the-art BLI methods rely on the induction of cross-lingual word embeddings (CLWEs) to capture cross-lingual word similarities; such CLWEs are obtained 1) via traditional static models (e.g., VecMap), or 2) by extracting type-level CLWEs from multilingual pretrained language models (mPLMs), or 3) through combining the former two options. In this work, we propose a novel semi-supervised post-hoc reranking method termed BLICEr (BLI with Cross-Encoder Reranking), applicable to any precalculated CLWE space, which improves their BLI capability. The key idea is to 'extract' cross-lingual lexical knowledge from mPLMs, and then combine it with the original CLWEs. This crucial step is done via 1) creating a word similarity dataset, comprising positive word pairs (i.e., true translations) and hard negative pairs induced from the original CLWE space, and then 2) fine-tuning an mPLM (e.g., mBERT or XLM-R) in a cross-encoder manner to predict the similarity scores. At inference, we 3) combine the similarity score from the original CLWE space with the score from the BLI-tuned cross-encoder. BLICEr establishes new state-of-the-art results on two standard BLI benchmarks spanning a wide spectrum of diverse languages: it substantially outperforms a series of strong baselines across the board. We also validate the robustness of BLICEr with different CLWEs.

Related papers

How Good is BLI as an Alignment Measure: A Study in Word Embedding Paradigm [1.4712349476860904]
We explore the strengths and limitations of BLI as a measure to evaluate the degree of alignment of two embedding spaces.<n>We evaluate how well traditional embedding alignment techniques, novel multilingual models, and combined alignment techniques perform BLI tasks.<n>We propose a novel stem-based BLI approach to evaluate two aligned embedding spaces that take into account the inflected nature of languages.
arXiv Detail & Related papers (2025-11-17T06:41:41Z)
Cross-Document Cross-Lingual Natural Language Inference via RST-enhanced Graph Fusion and Interpretability Prediction [7.888128236684232]
Natural Language Inference (NLI) is a fundamental task in both natural language processing and information retrieval. We propose a novel paradigm for CDCL-NLI that extends traditional NLI capabilities to multi-document, multilingual scenarios. Our work sheds light on the study of NLI and will bring research interest on cross-document cross-lingual context understanding.
arXiv Detail & Related papers (2025-04-11T13:18:26Z)
Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning [58.92843729869586]
Vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, but their mastery in a few languages like English restricts their applicability in broader communities. We propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF) We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance.
arXiv Detail & Related papers (2024-01-30T17:14:05Z)
On Bilingual Lexicon Induction with Large Language Models [81.6546357879259]
We examine the potential of the latest generation of Large Language Models for the development of bilingual lexicons. We study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs. Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs.
arXiv Detail & Related papers (2023-10-21T12:43:27Z)
VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments. Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs. token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z)
Multilingual Sentence Transformer as A Multilingual Word Aligner [15.689680887384847]
We investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner. Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties. Our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.
arXiv Detail & Related papers (2023-01-28T09:28:55Z)
A Multi-level Supervised Contrastive Learning Framework for Low-Resource Natural Language Inference [54.678516076366506]
Natural Language Inference (NLI) is a growingly essential task in natural language understanding. Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference.
arXiv Detail & Related papers (2022-05-31T05:54:18Z)
Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models. We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z)
Improving Word Translation via Two-Stage Contrastive Learning [46.71404992627519]
We propose a robust and effective two-stage contrastive learning framework for the BLI task. Comprehensive experiments on standard BLI datasets for diverse languages show substantial gains achieved by our framework.
arXiv Detail & Related papers (2022-03-15T22:51:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.