RED$^{\rm FM}$: a Filtered and Multilingual Relation Extraction Dataset
- URL: http://arxiv.org/abs/2306.09802v2
- Date: Mon, 19 Jun 2023 09:25:27 GMT
- Title: RED$^{\rm FM}$: a Filtered and Multilingual Relation Extraction Dataset
- Authors: Pere-Llu\'is Huguet Cabot and Simone Tedeschi and Axel-Cyrille Ngonga
Ngomo and Roberto Navigli
- Abstract summary: We present two new resources that enable the training and evaluation of multilingual Relation Extraction systems.
First, we present SRED$rm FM$, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances.
Second, we propose RED$rm FM$, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems.
- Score: 35.5973651237632
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Relation Extraction (RE) is a task that identifies relationships between
entities in a text, enabling the acquisition of relational facts and bridging
the gap between natural language and structured knowledge. However, current RE
models often rely on small datasets with low coverage of relation types,
particularly when working with languages other than English. In this paper, we
address the above issue and provide two new resources that enable the training
and evaluation of multilingual RE systems. First, we present SRED$^{\rm FM}$,
an automatically annotated dataset covering 18 languages, 400 relation types,
13 entity types, totaling more than 40 million triplet instances. Second, we
propose RED$^{\rm FM}$, a smaller, human-revised dataset for seven languages
that allows for the evaluation of multilingual RE systems. To demonstrate the
utility of these novel datasets, we experiment with the first end-to-end
multilingual RE model, mREBEL, that extracts triplets, including entity types,
in multiple languages. We release our resources and model checkpoints at
https://www.github.com/babelscape/rebel
Related papers
- Table Question Answering for Low-resourced Indic Languages [71.57359949962678]
TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output.
We introduce a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget.
We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models.
arXiv Detail & Related papers (2024-10-04T16:26:12Z) - How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism.
LLMs initially understand the query, converting multilingual inputs into English for task-solving.
In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - A Data Bootstrapping Recipe for Low Resource Multilingual Relation
Classification [38.83366564843953]
IndoRE is a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English.
We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information.
We study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned'silver' instances.
arXiv Detail & Related papers (2021-10-18T18:40:46Z) - MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available.
We collected around 6M FAQ pairs from the web, in 21 different languages.
We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z) - MTVR: Multilingual Moment Retrieval in Videos [89.24431389933703]
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips.
The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles.
We propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages.
arXiv Detail & Related papers (2021-07-30T20:01:03Z) - DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation
Extraction [15.649929244635269]
We propose a new dataset, DiS-ReX, which alleviates these issues.
Our dataset has more than 1.5 million sentences, spanning across 4 languages with 36 relation classes + 1 no relation (NA) class.
We also modify the widely used bag attention models by encoding sentences using mBERT and provide the first benchmark results on multilingual DS-RE.
arXiv Detail & Related papers (2021-04-17T22:44:38Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual
Relation Classification [0.0]
Current approaches for relation classification are mainly focused on the English language.
We propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup.
For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish.
arXiv Detail & Related papers (2020-10-19T11:08:16Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.