Constructing Colloquial Dataset for Persian Sentiment Analysis of Social
Microblogs
- URL: http://arxiv.org/abs/2306.12679v2
- Date: Thu, 7 Mar 2024 04:25:50 GMT
- Title: Constructing Colloquial Dataset for Persian Sentiment Analysis of Social
Microblogs
- Authors: Mojtaba Mazoochi (ICT Research Institute, Tehran, Iran), Leila Rabiei
(Iran Telecommunication Research Center (ITRC), Tehran, Iran), Farzaneh
Rahmani (Computer Department, Mehralborz University, Tehran, Iran), Zeinab
Rajabi (Computer Department, Hazrat-e Masoumeh University, Qom, Iran)
- Abstract summary: This paper first constructs a user opinion dataset called ITRC-Opinion in a collaborative environment and insource way.
Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram.
Second, this study proposes a new architecture based on the convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Introduction: Microblogging websites have massed rich data sources for
sentiment analysis and opinion mining. In this regard, sentiment classification
has frequently proven inefficient because microblog posts typically lack
syntactically consistent terms and representatives since users on these social
networks do not like to write lengthy statements. Also, there are some
limitations to low-resource languages. The Persian language has exceptional
characteristics and demands unique annotated data and models for the sentiment
analysis task, which are distinctive from text features within the English
dialect. Method: This paper first constructs a user opinion dataset called
ITRC-Opinion in a collaborative environment and insource way. Our dataset
contains 60,000 informal and colloquial Persian texts from social microblogs
such as Twitter and Instagram. Second, this study proposes a new architecture
based on the convolutional neural network (CNN) model for more effective
sentiment analysis of colloquial text in social microblog posts. The
constructed datasets are used to evaluate the presented architecture.
Furthermore, some models, such as LSTM, CNN-RNN, BiLSTM, and BiGRU with
different word embeddings, including Fasttext, Glove, and Word2vec,
investigated our dataset and evaluated the results. Results: The results
demonstrate the benefit of our dataset and the proposed model (72% accuracy),
displaying meaningful improvement in sentiment classification performance.
Related papers
- The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks.
Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%.
For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - Convolutional Neural Networks for Sentiment Analysis on Weibo Data: A
Natural Language Processing Approach [0.228438857884398]
This study addresses the complex task of sentiment analysis on a dataset of 119,988 original tweets from Weibo using a Convolutional Neural Network (CNN)
A CNN-based model was utilized, leveraging word embeddings for feature extraction, and trained to perform sentiment classification.
The model achieved a macro-average F1-score of approximately 0.73 on the test set, showing balanced performance across positive, neutral, and negative sentiments.
arXiv Detail & Related papers (2023-07-13T03:02:56Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Detecting Offensive Language on Social Networks: An End-to-end Detection
Method based on Graph Attention Networks [7.723697303436006]
We propose an end-to-end method based on community structure and text features for offensive language detection (CT-OLD)
We add user opinion to the community structure for representing user features. The user opinion is represented by user historical behavior information, which outperforms that represented by text information.
arXiv Detail & Related papers (2022-03-04T03:57:18Z) - T-BERT -- Model for Sentiment Analysis of Micro-blogs Integrating Topic
Model and BERT [0.0]
The effectiveness of BERT(Bidirectional Representations from Transformers) in sentiment classification tasks from a raw live dataset is demonstrated.
A novel T-BERT framework is proposed to show the enhanced performance obtainable by combining latent topics with contextual BERT embeddings.
arXiv Detail & Related papers (2021-06-02T12:01:47Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.