Related papers: LexGen: Domain-aware Multilingual Lexicon Generation

LexGen: Domain-aware Multilingual Lexicon Generation

URL: http://arxiv.org/abs/2405.11200v2
Date: Tue, 24 Sep 2024 06:31:12 GMT
Title: LexGen: Domain-aware Multilingual Lexicon Generation
Authors: Ayush Maheshwari, Atul Kumar Singh, Karthika NJ, Krishnakant Bhatt, Preethi Jyothi, Ganesh Ramakrishnan,
Abstract summary: We propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information. We release a new benchmark dataset across 6 Indian languages that span 8 diverse domains.
Score: 40.97738267067852
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Lexicon or dictionary generation across domains is of significant societal importance, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping-based or corpora-based approaches. Though initiated by researchers, the research associated with lexicon generation is limited, even more so with domain-specific lexicons. This task becomes particularly important in atypical medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and negligibly low data availability of technical terms in many low-resource languages. Owing to the research gap in lexicon generation, especially with a limited focus on the domain-specific area, we propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. Further, we propose an approach to explicitly leverage the relatedness between these Indian languages toward coherent translation. We also release a new benchmark dataset across 6 Indian languages that span 8 diverse domains that can propel further research in domain-specific lexicon induction. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages.

Related papers

Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models [22.297388572921477]
We propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries.<n>Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI.<n> Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.
arXiv Detail & Related papers (2025-05-29T06:37:02Z)
A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation [52.0964459842176]
Current state-of-the-art dialogue systems heavily rely on extensive training datasets. We propose a novel data textbfAugmentation framework for textbfMulti-textbfDomain textbfDialogue textbfGeneration, referred to as textbfAMD$2$G. The AMD$2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training.
arXiv Detail & Related papers (2024-06-14T09:52:27Z)
Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z)
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey [100.24095818099522]
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) They provide a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles.
arXiv Detail & Related papers (2023-05-30T03:00:30Z)
Can Domains Be Transferred Across Languages in Multi-Domain Multilingual Neural Machine Translation? [52.27798071809941]
This paper investigates whether the domain information can be transferred across languages on the composition of multi-domain and multilingual NMT. We find that multi-domain multilingual (MDML) NMT can boost zero-shot translation performance up to +10 gains on BLEU.
arXiv Detail & Related papers (2022-10-20T23:13:54Z)
Using Linguistic Typology to Enrich Multilingual Lexicons: the Case of Lexical Gaps in Kinship [4.970603969125883]
We capture the phenomenon of diversity through the notions of lexical gap and language-specific word. We publish a lexico-semantic resource consisting of 198 domain concepts, 1,911 words, and 37,370 gaps covering 699 languages.
arXiv Detail & Related papers (2022-04-11T12:36:26Z)
Cross-Domain Deep Code Search with Meta Learning [14.618183588410194]
We propose CroCS, a novel approach for domain-specific code search. CroCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages.
arXiv Detail & Related papers (2022-01-01T09:00:48Z)
MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model [17.566140528671134]
We show that a single multilingual domain-specific model can outperform the general multilingual model. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual.
arXiv Detail & Related papers (2021-09-14T11:50:26Z)
Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL) We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task. We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z)
Improving Low Compute Language Modeling with In-Domain Embedding Initialisation [47.08853566241831]
We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.
arXiv Detail & Related papers (2020-09-29T15:48:58Z)
Domain Adaptation for Semantic Parsing [68.81787666086554]
We propose a novel semantic for domain adaptation, where we have much fewer annotated data in the target domain compared to the source domain. Our semantic benefits from a two-stage coarse-to-fine framework, thus can provide different and accurate treatments for the two stages. Experiments on a benchmark dataset show that our method consistently outperforms several popular domain adaptation strategies.
arXiv Detail & Related papers (2020-06-23T14:47:41Z)
Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision. We propose domain data selection methods based on such models. We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.