DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation
Extraction
- URL: http://arxiv.org/abs/2104.08655v1
- Date: Sat, 17 Apr 2021 22:44:38 GMT
- Title: DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation
Extraction
- Authors: Abhyuday Bhartiya, Kartikeya Badola, Mausam
- Abstract summary: We propose a new dataset, DiS-ReX, which alleviates these issues.
Our dataset has more than 1.5 million sentences, spanning across 4 languages with 36 relation classes + 1 no relation (NA) class.
We also modify the widely used bag attention models by encoding sentences using mBERT and provide the first benchmark results on multilingual DS-RE.
- Score: 15.649929244635269
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distant supervision (DS) is a well established technique for creating
large-scale datasets for relation extraction (RE) without using human
annotations. However, research in DS-RE has been mostly limited to the English
language. Constraining RE to a single language inhibits utilization of large
amounts of data in other languages which could allow extraction of more diverse
facts. Very recently, a dataset for multilingual DS-RE has been released.
However, our analysis reveals that the proposed dataset exhibits unrealistic
characteristics such as 1) lack of sentences that do not express any relation,
and 2) all sentences for a given entity pair expressing exactly one relation.
We show that these characteristics lead to a gross overestimation of the model
performance. In response, we propose a new dataset, DiS-ReX, which alleviates
these issues. Our dataset has more than 1.5 million sentences, spanning across
4 languages with 36 relation classes + 1 no relation (NA) class. We also modify
the widely used bag attention models by encoding sentences using mBERT and
provide the first benchmark results on multilingual DS-RE. Unlike the competing
dataset, we show that our dataset is challenging and leaves enough room for
future research to take place in this field.
Related papers
- MASSIVE Multilingual Abstract Meaning Representation: A Dataset and Baselines for Hallucination Detection [3.6811136816751513]
We introduce MASSIVE-AMR, a dataset with more than 84,000 text-to-graph annotations.
AMR graphs for 1,685 information-seeking utterances mapped to 50+ typologically diverse languages.
Results shed light on persistent issues using LLMs for structured parsing.
arXiv Detail & Related papers (2024-05-29T17:17:22Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available.
We collected around 6M FAQ pairs from the web, in 21 different languages.
We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z) - Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question
Answering [8.558954185502012]
We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data.
We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr)
arXiv Detail & Related papers (2020-10-23T20:09:01Z) - The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual
Relation Classification [0.0]
Current approaches for relation classification are mainly focused on the English language.
We propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup.
For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish.
arXiv Detail & Related papers (2020-10-19T11:08:16Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing
Benchmark [31.91964553419665]
We present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains.
We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments.
We demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.
arXiv Detail & Related papers (2020-08-21T07:02:11Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.