Improving Persian Relation Extraction Models by Data Augmentation
- URL: http://arxiv.org/abs/2203.15323v1
- Date: Tue, 29 Mar 2022 08:08:47 GMT
- Title: Improving Persian Relation Extraction Models by Data Augmentation
- Authors: Moein Salimi Sartakhti, Romina Etezadi, Mehrnoush Shamsfard
- Abstract summary: We present our augmented dataset and the results and findings of our system.
We use PERLEX as the base dataset and enhance it by applying some text preprocessing steps.
We then employ two different models including ParsBERT and multilingual BERT for relation extraction on the augmented PERLEX dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Relation extraction that is the task of predicting semantic relation type
between entities in a sentence or document is an important task in natural
language processing. Although there are many researches and datasets for
English, Persian suffers from sufficient researches and comprehensive datasets.
The only available Persian dataset for this task is PERLEX, which is a Persian
expert-translated version of the SemEval-2010-Task-8 dataset. In this paper, we
present our augmented dataset and the results and findings of our system,
participated in the Persian relation Extraction shared task of NSURL 2021
workshop. We use PERLEX as the base dataset and enhance it by applying some
text preprocessing steps and by increasing its size via data augmentation
techniques to improve the generalization and robustness of applied models. We
then employ two different models including ParsBERT and multilingual BERT for
relation extraction on the augmented PERLEX dataset. Our best model obtained
64.67% of Macro-F1 on the test phase of the contest and it achieved 83.68% of
Macro-F1 on the test set of PERLEX.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - HYBRINFOX at CheckThat! 2024 -- Task 1: Enhancing Language Models with Structured Information for Check-Worthiness Estimation [0.8083061106940517]
This paper summarizes the experiments and results of the HYBRINFOX team for the CheckThat! 2024 - Task 1 competition.
We propose an approach enriching Language Models such as RoBERTa with embeddings produced by triples.
arXiv Detail & Related papers (2024-07-04T11:33:54Z) - Information Extraction: An application to the domain of hyper-local financial data on developing countries [0.0]
We develop and evaluate two Natural Language Processing (NLP) based techniques to address this issue.
First, we curate a custom dataset specific to the domain of financial text data on developing countries.
We then explore a text-to-text approach with the transformer-based T5 model with the goal of undertaking simultaneous NER and relation extraction.
arXiv Detail & Related papers (2024-03-14T03:49:36Z) - NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets.
We employ an external datastore for retrieving similar skills in a dataset-unifying manner.
We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - A Simple and Efficient Ensemble Classifier Combining Multiple Neural
Network Models on Social Media Datasets in Vietnamese [2.7528170226206443]
This study aims to classify Vietnamese texts on social media from three different Vietnamese benchmark datasets.
Advanced deep learning models are used and optimized in this study, including CNN, LSTM, and their variants.
Our ensemble model achieves the best performance on all three datasets.
arXiv Detail & Related papers (2020-09-28T04:28:48Z) - PERLEX: A Bilingual Persian-English Gold Dataset for Relation Extraction [6.10917825357379]
"PERLEX" is the first dataset for relation extraction in the Persian language.
We employ six different models for relation extraction on the proposed bilingual dataset.
Experiments result in the maximum f-score 77.66% as the state-of-the-art of relation extraction in the Persian language.
arXiv Detail & Related papers (2020-05-13T21:06:59Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.