Related papers: Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models

Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models

URL: http://arxiv.org/abs/2505.23146v1
Date: Thu, 29 May 2025 06:37:02 GMT
Title: Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models
Authors: Qiuyu Ding, Zhiqiang Cao, Hailong Cao, Tiejun Zhao,
Abstract summary: We propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries.<n>Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI.<n> Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.
Score: 22.297388572921477
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bilingual Lexicon Induction (BLI) is generally based on common domain data to obtain monolingual word embedding, and by aligning the monolingual word embeddings to obtain the cross-lingual embeddings which are used to get the word translation pairs. In this paper, we propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries. Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI. This way, we introduce the Code Switch(Qin et al., 2020) firstly in the cross-domain BLI task, which can match differit is yet to be seen whether these methods are suitable for bilingual lexicon extraction in professional fields. As we can see in table 1, the classic and efficient BLI approach, Muse and Vecmap, perform much worse on the Medical dataset than on the Wiki dataset. On one hand, the specialized domain data set is relatively smaller compared to the generic domain data set generally, and specialized words have a lower frequency, which will directly affect the translation quality of bilingual dictionaries. On the other hand, static word embeddings are widely used for BLI, however, in some specific fields, the meaning of words is greatly influenced by context, in this case, using only static word embeddings may lead to greater bias. ent strategies in different contexts, making the model more suitable for this task. Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.

Related papers

Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries [22.562544826766917]
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages.<n>Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.
arXiv Detail & Related papers (2025-06-02T10:52:52Z)
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries. Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z)
ProMap: Effective Bilingual Lexicon Induction via Language Model Prompting [22.743097175747575]
We introduce ProMap, a novel approach for bilingual induction (BLI) ProMap relies on an effective padded prompting of language models with a seed dictionary that achieves good performance when used independently. When evaluated on both rich-resource and low-resource languages, ProMap consistently achieves state-of-the-art results.
arXiv Detail & Related papers (2023-10-28T18:33:24Z)
On Bilingual Lexicon Induction with Large Language Models [81.6546357879259]
We examine the potential of the latest generation of Large Language Models for the development of bilingual lexicons. We study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs. Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs.
arXiv Detail & Related papers (2023-10-21T12:43:27Z)
Improving Bilingual Lexicon Induction with Cross-Encoder Reranking [31.142790337451366]
We propose a novel semi-supervised post-hoc reranking method termed BLICEr (BLI with Cross-Encoder Reranking) The key idea is to 'extract' cross-lingual lexical knowledge from mPLMs, and then combine it with the original CLWEs. BLICEr establishes new results on two standard BLI benchmarks spanning a wide spectrum of diverse languages.
arXiv Detail & Related papers (2022-10-30T21:26:07Z)
Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models. We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z)
Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models. We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z)
Combining Static Word Embeddings and Contextual Representations for Bilingual Lexicon Induction [19.375597786174197]
We propose a simple yet effective mechanism to combine the static word embeddings and the contextual representations. We test the combination mechanism on various language pairs under the supervised and unsupervised BLI benchmark settings.
arXiv Detail & Related papers (2021-06-06T10:31:02Z)
Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL) We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task. We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z)
Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring [41.77270308094212]
We propose an alternative mapping approach for word embeddings in languages other than English. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.
arXiv Detail & Related papers (2020-12-31T17:10:14Z)
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings. We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.