PERLEX: A Bilingual Persian-English Gold Dataset for Relation Extraction
- URL: http://arxiv.org/abs/2005.06588v1
- Date: Wed, 13 May 2020 21:06:59 GMT
- Title: PERLEX: A Bilingual Persian-English Gold Dataset for Relation Extraction
- Authors: Majid Asgari-Bidhendi, Mehrdad Nasser, Behrooz Janfada, Behrouz
Minaei-Bidgoli
- Abstract summary: "PERLEX" is the first dataset for relation extraction in the Persian language.
We employ six different models for relation extraction on the proposed bilingual dataset.
Experiments result in the maximum f-score 77.66% as the state-of-the-art of relation extraction in the Persian language.
- Score: 6.10917825357379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Relation extraction is the task of extracting semantic relations between
entities in a sentence. It is an essential part of some natural language
processing tasks such as information extraction, knowledge extraction, and
knowledge base population. The main motivations of this research stem from a
lack of a dataset for relation extraction in the Persian language as well as
the necessity of extracting knowledge from the growing big-data in the Persian
language for different applications. In this paper, we present "PERLEX" as the
first Persian dataset for relation extraction, which is an expert-translated
version of the "Semeval-2010-Task-8" dataset. Moreover, this paper addresses
Persian relation extraction utilizing state-of-the-art language-agnostic
algorithms. We employ six different models for relation extraction on the
proposed bilingual dataset, including a non-neural model (as the baseline),
three neural models, and two deep learning models fed by multilingual-BERT
contextual word representations. The experiments result in the maximum f-score
77.66% (provided by BERTEM-MTB method) as the state-of-the-art of relation
extraction in the Persian language.
Related papers
- MixRED: A Mix-lingual Relation Extraction Dataset [35.5919056167744]
We introduce a novel task of considering relation extraction in the mix-lingual scenario called MixRE.
In addition to constructing the MixRED dataset, we evaluate both state-of-the-art supervised models and large language models (LLMs) on MixRED.
arXiv Detail & Related papers (2024-03-23T03:18:14Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - ImPaKT: A Dataset for Open-Schema Knowledge Base Construction [10.073210304061966]
ImPaKT is a dataset for open-schema information extraction consisting of around 2500 text snippets from the C4 corpus, in the shopping domain (product buying guides)
We evaluate the power of this approach by fine-tuning the open source UL2 language model on a subset of the dataset, extracting a set of implication relations from a corpus of product buying guides, and conducting human evaluations of the resulting predictions.
arXiv Detail & Related papers (2022-12-21T05:02:49Z) - Towards Relation Extraction From Speech [56.36416922396724]
We propose a new listening information extraction task, i.e., speech relation extraction.
We construct the training dataset for speech relation extraction via text-to-speech systems, and we construct the testing dataset via crowd-sourcing with native English speakers.
We conduct comprehensive experiments to distinguish the challenges in speech relation extraction, which may shed light on future explorations.
arXiv Detail & Related papers (2022-10-17T05:53:49Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Improving Persian Relation Extraction Models by Data Augmentation [0.0]
We present our augmented dataset and the results and findings of our system.
We use PERLEX as the base dataset and enhance it by applying some text preprocessing steps.
We then employ two different models including ParsBERT and multilingual BERT for relation extraction on the augmented PERLEX dataset.
arXiv Detail & Related papers (2022-03-29T08:08:47Z) - Improving Sentence-Level Relation Extraction through Curriculum Learning [7.117139527865022]
We propose a curriculum learning-based relation extraction model that split data by difficulty and utilize it for learning.
In the experiments with the representative sentence-level relation extraction datasets, TACRED and Re-TACRED, the proposed method showed good performances.
arXiv Detail & Related papers (2021-07-20T08:44:40Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.