FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts
- URL: http://arxiv.org/abs/2407.19173v1
- Date: Sat, 27 Jul 2024 05:04:49 GMT
- Title: FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts
- Authors: Seyed Mojtaba Sadjadi, Zeinab Rajabi, Leila Rabiei, Mohammad-Shahram Moin,
- Abstract summary: This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks.
It is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language.
It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: One fundamental task for NLP is to determine the similarity between two texts and evaluate the extent of their likeness. The previous methods for the Persian language have low accuracy and are unable to comprehend the structure and meaning of texts effectively. Additionally, these methods primarily focus on formal texts, but in real-world applications of text processing, there is a need for robust methods that can handle colloquial texts. This requires algorithms that consider the structure and significance of words based on context, rather than just the frequency of words. The lack of a proper dataset for this task in the Persian language makes it important to develop such algorithms and construct a dataset for Persian text. This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks. In addition, a Persian dataset named FarSSiM has been constructed for this purpose, using real data from social networks and manually annotated and verified by a linguistic expert team. The proposed model involves training a large language model using the BERT architecture from scratch. This model, called FarSSiBERT, is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language. Moreover, a novel specialized informal language tokenizer is provided that not only performs tokenization on formal texts well but also accurately identifies tokens that other Persian tokenizers are unable to recognize. It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria. Additionally, the pre-trained large language model has great potential for use in other NLP tasks on colloquial text and as a tokenizer for less-known informal words.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification [0.7373617024876725]
PSC, Persian Slang Converter, is a tool for converting conversational texts into formal ones.
Ten million unlabeled texts from various social networks and movie subtitles (as Conversational texts) have been used for training unsupervised models.
60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model.
arXiv Detail & Related papers (2024-03-09T22:18:26Z) - FaBERT: Pre-training BERT on Persian Blogs [13.566089841138938]
FaBERT is a Persian BERT-base model pre-trained on the HmBlogs corpus.
It addresses the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language.
arXiv Detail & Related papers (2024-02-09T18:50:51Z) - PersianLLaMA: Towards Building First Persian Large Language Model [5.79461948374354]
This paper introduces the first large Persian language model, named PersianLLaMA, trained on a collection of Persian texts and datasets.
The results indicate that PersianLLaMA significantly outperforms its competitors in both understanding and generating Persian text.
arXiv Detail & Related papers (2023-12-25T12:48:55Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - ViraPart: A Text Refinement Framework for ASR and NLP Tasks in Persian [0.0]
We propose a ViraPart framework that uses embedded ParsBERT in its core for text clarifications.
In the end, the proposed model for ZWNJ recognition, punctuation restoration, and Persian Ezafe construction performs the averaged F1 macro scores of 96.90%, 92.13%, and 98.50%, respectively.
arXiv Detail & Related papers (2021-10-18T08:20:40Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Evaluating Input Representation for Language Identification in
Hindi-English Code Mixed Text [4.4904382374090765]
Code-mixed text comprises text written in more than one language.
People naturally tend to combine local language with global languages like English.
In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text.
arXiv Detail & Related papers (2020-11-23T08:08:09Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.