ParsEL 1.0: Unsupervised Entity Linking in Persian Social Media Texts
- URL: http://arxiv.org/abs/2004.10816v1
- Date: Wed, 22 Apr 2020 19:34:13 GMT
- Title: ParsEL 1.0: Unsupervised Entity Linking in Persian Social Media Texts
- Authors: Majid Asgari-Bidhendi, Farzane Fakhrian and Behrouz Minaei-Bidgoli
- Abstract summary: A large portion of social media data is natural language text.
Recently, FarsBase, a Persian knowledge graph, has been introduced containing almost half a million entities.
In this paper, we propose an unsupervised Persian Entity Linking system.
- Score: 6.866104126509981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, social media data has exponentially increased, which can be
enumerated as one of the largest data repositories in the world. A large
portion of this social media data is natural language text. However, the
natural language is highly ambiguous due to exposure to the frequent
occurrences of entities, which have polysemous words or phrases. Entity linking
is the task of linking the entity mentions in the text to their corresponding
entities in a knowledge base. Recently, FarsBase, a Persian knowledge graph,
has been introduced containing almost half a million entities. In this paper,
we propose an unsupervised Persian Entity Linking system, the first entity
linking system specially focused on the Persian language, which utilizes
context-dependent and context-independent features. For this purpose, we also
publish the first entity linking corpus of the Persian language containing
67,595 words that have been crawled from social media texts of some popular
channels in the Telegram messenger. The output of the proposed method is 86.94%
f-score for the Persian language, which is comparable with the similar
state-of-the-art methods in the English language.
Related papers
- FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts [0.0]
This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks.
It is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language.
It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria.
arXiv Detail & Related papers (2024-07-27T05:04:49Z) - PersianLLaMA: Towards Building First Persian Large Language Model [5.79461948374354]
This paper introduces the first large Persian language model, named PersianLLaMA, trained on a collection of Persian texts and datasets.
The results indicate that PersianLLaMA significantly outperforms its competitors in both understanding and generating Persian text.
arXiv Detail & Related papers (2023-12-25T12:48:55Z) - Persian topic detection based on Human Word association and graph
embedding [3.8137985834223507]
We propose a framework to detect topics in social media based on Human Word Association.
Most of the work done in this area is in English, but much has been done in the Persian language.
arXiv Detail & Related papers (2023-02-20T05:46:47Z) - Knowledge-Grounded Conversational Data Augmentation with Generative
Conversational Networks [76.11480953550013]
We take a step towards automatically generating conversational data using Generative Conversational Networks.
We evaluate our approach on conversations with and without knowledge on the Topical Chat dataset.
arXiv Detail & Related papers (2022-07-22T22:37:14Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - The Challenges of Persian User-generated Textual Content: A Machine
Learning-Based Approach [0.0]
This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content.
The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language.
The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
arXiv Detail & Related papers (2021-01-20T11:57:59Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - A novel approach to sentiment analysis in Persian using discourse and
external semantic information [0.0]
Many approaches have been proposed to extract the sentiment of individuals from documents written in natural languages.
The majority of these approaches have focused on English, while resource-lean languages such as Persian suffer from the lack of research work and language resources.
Due to this gap in Persian, the current work is accomplished to introduce new methods for sentiment analysis which have been applied on Persian.
arXiv Detail & Related papers (2020-07-18T18:40:40Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.