The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual
Relation Classification
- URL: http://arxiv.org/abs/2010.09381v1
- Date: Mon, 19 Oct 2020 11:08:16 GMT
- Title: The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual
Relation Classification
- Authors: Abdullatif K\"oksal, Arzucan \"Ozg\"ur
- Abstract summary: Current approaches for relation classification are mainly focused on the English language.
We propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup.
For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Relation classification is one of the key topics in information extraction,
which can be used to construct knowledge bases or to provide useful information
for question answering. Current approaches for relation classification are
mainly focused on the English language and require lots of training data with
human annotations. Creating and annotating a large amount of training data for
low-resource languages is impractical and expensive. To overcome this issue, we
propose two cross-lingual relation classification models: a baseline model
based on Multilingual BERT and a new multilingual pretraining setup, which
significantly improves the baseline with distant supervision. For evaluation,
we introduce a new public benchmark dataset for cross-lingual relation
classification in English, French, German, Spanish, and Turkish, called RELX.
We also provide the RELX-Distant dataset, which includes hundreds of thousands
of sentences with relations from Wikipedia and Wikidata collected by distant
supervision for these languages. Our code and data are available at:
https://github.com/boun-tabi/RELX
Related papers
- $\mu$PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge [72.64847925450368]
Cross-lingual summarization consists of generating a summary in one language given an input document in a different language.
This work presents $mu$PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge.
arXiv Detail & Related papers (2023-05-23T16:25:21Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+
Language Pairs [27.574815708395203]
CrossSum is a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs.
We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset.
We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language.
arXiv Detail & Related papers (2021-12-16T11:40:36Z) - A Data Bootstrapping Recipe for Low Resource Multilingual Relation
Classification [38.83366564843953]
IndoRE is a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English.
We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information.
We study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned'silver' instances.
arXiv Detail & Related papers (2021-10-18T18:40:46Z) - MultiEURLEX -- A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer [13.24356999779404]
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents.
The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy.
We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target)
arXiv Detail & Related papers (2021-09-02T12:52:55Z) - Multilingual Compositional Wikidata Questions [9.602430657819564]
We propose a method for creating a multilingual, parallel dataset of question-Query pairs grounded in Wikidata.
We use this data to train semantics for Hebrew, Kannada, Chinese and English to better understand the current strengths and weaknesses of multilingual semantic parsing.
arXiv Detail & Related papers (2021-08-07T19:40:38Z) - Improving Low-resource Reading Comprehension via Cross-lingual
Transposition Rethinking [0.9236074230806579]
Extractive Reading (ERC) has made tremendous advances enabled by the availability of large-scale high-quality ERC training data.
Despite of such rapid progress and widespread application, the datasets in languages other than high-resource languages such as English remain scarce.
We propose a Cross-Lingual Transposition ReThinking (XLTT) model by modelling existing high-quality extractive reading comprehension datasets in a multilingual environment.
arXiv Detail & Related papers (2021-07-11T09:35:16Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.