ViraPart: A Text Refinement Framework for ASR and NLP Tasks in Persian
- URL: http://arxiv.org/abs/2110.09086v2
- Date: Tue, 19 Oct 2021 11:24:30 GMT
- Title: ViraPart: A Text Refinement Framework for ASR and NLP Tasks in Persian
- Authors: Narges Farokhshad, Milad Molazadeh, Saman Jamalabbasi, Hamed Babaei
Giglou, Saeed Bibak
- Abstract summary: We propose a ViraPart framework that uses embedded ParsBERT in its core for text clarifications.
In the end, the proposed model for ZWNJ recognition, punctuation restoration, and Persian Ezafe construction performs the averaged F1 macro scores of 96.90%, 92.13%, and 98.50%, respectively.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Persian language is an inflectional SOV language. This fact makes Persian
a more uncertain language. However, using techniques such as ZWNJ recognition,
punctuation restoration, and Persian Ezafe construction will lead us to a more
understandable and precise language. In most of the works in Persian, these
techniques are addressed individually. Despite that, we believe that for text
refinement in Persian, all of these tasks are necessary. In this work, we
proposed a ViraPart framework that uses embedded ParsBERT in its core for text
clarifications. First, used the BERT variant for Persian following by a
classifier layer for classification procedures. Next, we combined models
outputs to output cleartext. In the end, the proposed model for ZWNJ
recognition, punctuation restoration, and Persian Ezafe construction performs
the averaged F1 macro scores of 96.90%, 92.13%, and 98.50%, respectively.
Experimental results show that our proposed approach is very effective in text
refinement for the Persian language.
Related papers
- FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts [0.0]
This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks.
It is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language.
It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria.
arXiv Detail & Related papers (2024-07-27T05:04:49Z) - FaBERT: Pre-training BERT on Persian Blogs [13.566089841138938]
FaBERT is a Persian BERT-base model pre-trained on the HmBlogs corpus.
It addresses the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language.
arXiv Detail & Related papers (2024-02-09T18:50:51Z) - PersianLLaMA: Towards Building First Persian Large Language Model [5.79461948374354]
This paper introduces the first large Persian language model, named PersianLLaMA, trained on a collection of Persian texts and datasets.
The results indicate that PersianLLaMA significantly outperforms its competitors in both understanding and generating Persian text.
arXiv Detail & Related papers (2023-12-25T12:48:55Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Bag of Tricks for Effective Language Model Pretraining and Downstream
Adaptation: A Case Study on GLUE [93.98660272309974]
This report briefly describes our submission Vega v1 on the General Language Understanding Evaluation leaderboard.
GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference.
With our optimized pretraining and fine-tuning strategies, our 1.3 billion model sets new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
arXiv Detail & Related papers (2023-02-18T09:26:35Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Effidit: Your AI Writing Assistant [60.588370965898534]
Effidit is a digital writing assistant that facilitates users to write higher-quality text more efficiently by using artificial intelligence (AI) technologies.
In Effidit, we significantly expand the capacities of a writing assistant by providing functions in five categories: text completion, error checking, text polishing, keywords to sentences (K2S), and cloud input methods (cloud IME)
arXiv Detail & Related papers (2022-08-03T02:24:45Z) - Evaluating Persian Tokenizers [6.10917825357379]
This article introduces a novel work by the most widely used tokenizers for Persian.
It compares and evaluating their performance on Persian texts using a simple algorithm with a pre-tagged Persian dependency dataset.
After evaluating tokenizers with the F1-Score, the hybrid version of the Farsi Verb and Hazm with bounded morphemes fixing showed the best performance with an F1 score of 98.97%.
arXiv Detail & Related papers (2022-02-22T13:27:24Z) - The Challenges of Persian User-generated Textual Content: A Machine
Learning-Based Approach [0.0]
This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content.
The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language.
The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
arXiv Detail & Related papers (2021-01-20T11:57:59Z) - TextHide: Tackling Data Privacy in Language Understanding Tasks [54.11691303032022]
TextHide mitigates privacy risks without slowing down training or reducing accuracy.
It requires all participants to add a simple encryption step to prevent an eavesdropping attacker from recovering private text data.
We evaluate TextHide on the GLUE benchmark, and our experiments show that TextHide can effectively defend attacks on shared gradients or representations.
arXiv Detail & Related papers (2020-10-12T22:22:15Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.