The Challenges of Persian User-generated Textual Content: A Machine
Learning-Based Approach
- URL: http://arxiv.org/abs/2101.08087v1
- Date: Wed, 20 Jan 2021 11:57:59 GMT
- Title: The Challenges of Persian User-generated Textual Content: A Machine
Learning-Based Approach
- Authors: Mohammad Kasra Habib
- Abstract summary: This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content.
The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language.
The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over recent years a lot of research papers and studies have been published on
the development of effective approaches that benefit from a large amount of
user-generated content and build intelligent predictive models on top of them.
This research applies machine learning-based approaches to tackle the hurdles
that come with Persian user-generated textual content. Unfortunately, there is
still inadequate research in exploiting machine learning approaches to
classify/cluster Persian text. Further, analyzing Persian text suffers from a
lack of resources; specifically from datasets and text manipulation tools.
Since the syntax and semantics of the Persian language is different from
English and other languages, the available resources from these languages are
not instantly usable for Persian. In addition, recognition of nouns and
pronouns, parts of speech tagging, finding words' boundary, stemming or
character manipulations for Persian language are still unsolved issues that
require further studying. Therefore, efforts have been made in this research to
address some of the challenges. This presented approach uses a
machine-translated datasets to conduct sentiment analysis for the Persian
language. Finally, the dataset has been rehearsed with different classifiers
and feature engineering approaches. The results of the experiments have shown
promising state-of-the-art performance in contrast to the previous efforts; the
best classifier was Support Vector Machines which achieved a precision of
91.22%, recall of 91.71%, and F1 score of 91.46%.
Related papers
- Enhancing Aspect-based Sentiment Analysis with ParsBERT in Persian Language [0.0]
This paper aims to amplify the efficiency of language models tailored to the Persian language.
The study centers on sentiment analysis of user opinions extracted from the Persian website 'Digikala'
arXiv Detail & Related papers (2025-02-03T06:25:06Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - PersianLLaMA: Towards Building First Persian Large Language Model [5.79461948374354]
This paper introduces the first large Persian language model, named PersianLLaMA, trained on a collection of Persian texts and datasets.
The results indicate that PersianLLaMA significantly outperforms its competitors in both understanding and generating Persian text.
arXiv Detail & Related papers (2023-12-25T12:48:55Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Evaluating Persian Tokenizers [6.10917825357379]
This article introduces a novel work by the most widely used tokenizers for Persian.
It compares and evaluating their performance on Persian texts using a simple algorithm with a pre-tagged Persian dependency dataset.
After evaluating tokenizers with the F1-Score, the hybrid version of the Farsi Verb and Hazm with bounded morphemes fixing showed the best performance with an F1 score of 98.97%.
arXiv Detail & Related papers (2022-02-22T13:27:24Z) - ViraPart: A Text Refinement Framework for ASR and NLP Tasks in Persian [0.0]
We propose a ViraPart framework that uses embedded ParsBERT in its core for text clarifications.
In the end, the proposed model for ZWNJ recognition, punctuation restoration, and Persian Ezafe construction performs the averaged F1 macro scores of 96.90%, 92.13%, and 98.50%, respectively.
arXiv Detail & Related papers (2021-10-18T08:20:40Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - A novel approach to sentiment analysis in Persian using discourse and
external semantic information [0.0]
Many approaches have been proposed to extract the sentiment of individuals from documents written in natural languages.
The majority of these approaches have focused on English, while resource-lean languages such as Persian suffer from the lack of research work and language resources.
Due to this gap in Persian, the current work is accomplished to introduce new methods for sentiment analysis which have been applied on Persian.
arXiv Detail & Related papers (2020-07-18T18:40:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.