Multilingual End to End Entity Linking
- URL: http://arxiv.org/abs/2306.08896v1
- Date: Thu, 15 Jun 2023 07:02:00 GMT
- Title: Multilingual End to End Entity Linking
- Authors: Mikhail Plekhanov, Nora Kassner, Kashyap Popat, Louis Martin, Simone
Merello, Borislav Kozlovskii, Fr\'ed\'eric A. Dreyer, Nicola Cancedda
- Abstract summary: We release and open source BELA, the first fully end-to-end multilingual entity linking model.
BelA efficiently detects and links entities in texts in any of 97 languages.
- Score: 6.478594417785639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Entity Linking is one of the most common Natural Language Processing tasks in
practical applications, but so far efficient end-to-end solutions with
multilingual coverage have been lacking, leading to complex model stacks. To
fill this gap, we release and open source BELA, the first fully end-to-end
multilingual entity linking model that efficiently detects and links entities
in texts in any of 97 languages. We provide here a detailed description of the
model and report BELA's performance on four entity linking datasets covering
high- and low-resource languages.
Related papers
- Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z) - LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision.
LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks.
We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z) - mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models [21.616940026409818]
Large language models (LLMs) with Chain-of-thought (CoT) have recently emerged as a powerful technique for eliciting reasoning to improve downstream tasks.
We study multilingual reasoning consistency across multiple languages, using popular open-source LLMs.
We introduce multilingual CoT instruction tuning to boost reasoning capability across languages, thereby improving model consistency.
arXiv Detail & Related papers (2024-06-04T13:30:45Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Multilingual Entity and Relation Extraction from Unified to
Language-specific Training [29.778332361215636]
Existing approaches for entity and relation extraction tasks mainly focus on the English corpora and ignore other languages.
We propose a two-stage multilingual training method and a joint model called Multilingual Entity and Relation Extraction framework (mERE) to mitigate language interference.
Our method outperforms both the monolingual and multilingual baseline methods.
arXiv Detail & Related papers (2023-01-11T12:26:53Z) - A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text
Classification [16.684856745734944]
We present a multilingual bag-of-entities model that boosts the performance of zero-shot cross-lingual text classification.
It leverages the multilingual nature of Wikidata: entities in multiple languages representing the same concept are defined with a unique identifier.
A model trained on entity features in a resource-rich language can thus be directly applied to other languages.
arXiv Detail & Related papers (2021-10-15T01:10:50Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.