FaBERT: Pre-training BERT on Persian Blogs
- URL: http://arxiv.org/abs/2402.06617v1
- Date: Fri, 9 Feb 2024 18:50:51 GMT
- Title: FaBERT: Pre-training BERT on Persian Blogs
- Authors: Mostafa Masumi, Seyed Soroush Majd, Mehrnoush Shamsfard, Hamid Beigy
- Abstract summary: FaBERT is a Persian BERT-base model pre-trained on the HmBlogs corpus.
It addresses the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language.
- Score: 13.566089841138938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs
corpus, encompassing both informal and formal Persian texts. FaBERT is designed
to excel in traditional Natural Language Understanding (NLU) tasks, addressing
the intricacies of diverse sentence structures and linguistic styles prevalent
in the Persian language. In our comprehensive evaluation of FaBERT on 12
datasets in various downstream tasks, encompassing Sentiment Analysis (SA),
Named Entity Recognition (NER), Natural Language Inference (NLI), Question
Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated
improved performance, all achieved within a compact model size. The findings
highlight the importance of utilizing diverse and cleaned corpora, such as
HmBlogs, to enhance the performance of language models like BERT in Persian
Natural Language Processing (NLP) applications. FaBERT is openly accessible at
https://huggingface.co/sbunlp/fabert
Related papers
- FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts [0.0]
This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks.
It is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language.
It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria.
arXiv Detail & Related papers (2024-07-27T05:04:49Z) - Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words [50.11559460111882]
We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension.
Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
arXiv Detail & Related papers (2022-02-24T15:15:48Z) - ViraPart: A Text Refinement Framework for ASR and NLP Tasks in Persian [0.0]
We propose a ViraPart framework that uses embedded ParsBERT in its core for text clarifications.
In the end, the proposed model for ZWNJ recognition, punctuation restoration, and Persian Ezafe construction performs the averaged F1 macro scores of 96.90%, 92.13%, and 98.50%, respectively.
arXiv Detail & Related papers (2021-10-18T08:20:40Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z) - AraBERT: Transformer-based Model for Arabic Language Understanding [0.0]
We pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language.
The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.
arXiv Detail & Related papers (2020-02-28T22:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.