Prix-LM: Pretraining for Multilingual Knowledge Base Construction
- URL: http://arxiv.org/abs/2110.08443v1
- Date: Sat, 16 Oct 2021 02:08:46 GMT
- Title: Prix-LM: Pretraining for Multilingual Knowledge Base Construction
- Authors: Wenxuan Zhou, Fangyu Liu, Ivan Vuli\'c, Nigel Collier, Muhao Chen
- Abstract summary: We propose a unified framework, Prix-LM, for multilingual knowledge construction and completion.
We leverage two types of knowledge, monolingual triples and cross-lingual links, extracted from existing multilingual KBs.
Experiments on standard entity-related tasks, such as link prediction in multiple languages, cross-lingual entity linking and bilingual lexicon induction, demonstrate its effectiveness.
- Score: 59.02868906044296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge bases (KBs) contain plenty of structured world and commonsense
knowledge. As such, they often complement distributional text-based information
and facilitate various downstream tasks. Since their manual construction is
resource- and time-intensive, recent efforts have tried leveraging large
pretrained language models (PLMs) to generate additional monolingual knowledge
facts for KBs. However, such methods have not been attempted for building and
enriching multilingual KBs. Besides wider application, such multilingual KBs
can provide richer combined knowledge than monolingual (e.g., English) KBs.
Knowledge expressed in different languages may be complementary and unequally
distributed: this implies that the knowledge available in high-resource
languages can be transferred to low-resource ones. To achieve this, it is
crucial to represent multilingual knowledge in a shared/unified space. To this
end, we propose a unified framework, Prix-LM, for multilingual KB construction
and completion. We leverage two types of knowledge, monolingual triples and
cross-lingual links, extracted from existing multilingual KBs, and tune a
multilingual language encoder XLM-R via a causal language modeling objective.
Prix-LM integrates useful multilingual and KB-based factual knowledge into a
single model. Experiments on standard entity-related tasks, such as link
prediction in multiple languages, cross-lingual entity linking and bilingual
lexicon induction, demonstrate its effectiveness, with gains reported over
strong task-specialised baselines.
Related papers
- BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment [42.193395498828764]
We introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages.
For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale.
Demo, homepage, code and models of BayLing are available.
arXiv Detail & Related papers (2024-11-25T11:35:08Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Cross-Lingual Question Answering over Knowledge Base as Reading
Comprehension [61.079852289005025]
Cross-lingual question answering over knowledge base (xKBQA) aims to answer questions in languages different from that of the provided knowledge base.
One of the major challenges facing xKBQA is the high cost of data annotation.
We propose a novel approach for xKBQA in a reading comprehension paradigm.
arXiv Detail & Related papers (2023-02-26T05:52:52Z) - Adapters for Enhanced Modeling of Multilingual Knowledge and Text [54.02078328453149]
Language models have been extended to multilingual language models (MLLMs)
Knowledge graphs contain facts in an explicit triple format, which require careful curation and are only available in a few high-resource languages.
We propose to enhance MLLMs with knowledge from multilingual knowledge graphs (MLKGs) so as to tackle language and knowledge graph tasks across many languages.
arXiv Detail & Related papers (2022-10-24T21:33:42Z) - Knowledge Based Multilingual Language Model [44.70205282863062]
We present a novel framework to pretrain knowledge based multilingual language models (KMLMs)
We generate a large amount of code-switched synthetic sentences and reasoning-based multilingual training data using the Wikidata knowledge graphs.
Based on the intra- and inter-sentence structures of the generated data, we design pretraining tasks to facilitate knowledge learning.
arXiv Detail & Related papers (2021-11-22T02:56:04Z) - A Multilingual Modeling Method for Span-Extraction Reading Comprehension [2.4905424368103444]
We propose a multilingual extractive reading comprehension approach called XLRC.
We show that our model outperforms the state-of-the-art baseline (i.e., RoBERTa_Large) on the CMRC 2018 task.
arXiv Detail & Related papers (2021-05-31T11:05:30Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.